17 datasets found

c
IMDB movie details dataset
crawlfeeds.com
csv, zip
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). IMDB movie details dataset [Dataset]. https://crawlfeeds.com/datasets/imdb-movie-details-dataset
Explore at:
zip, csvAvailable download formats
Dataset updated
Jul 5, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description

The IMDB Movie Details Dataset is a comprehensive collection of movie datasets that offers a treasure trove of information about movies, TV shows, and streaming content listed on IMDB. This dataset includes detailed data such as titles, release years, genres, cast, crew, ratings, and more, making it a go-to resource for film and entertainment enthusiasts. Ideal for data analysis, IMDB movie dataset applications span machine learning projects, predictive modeling, and insights into industry trends.

Researchers can explore patterns in movie ratings and genre popularity, while developers can use the dataset to build recommendation systems or applications. Movie buffs can dive deep into historical and contemporary trends in the world of cinema. This dataset not only supports academic and professional pursuits but also opens doors for creative projects in storytelling, content creation, and audience engagement. Whether you’re a developer, researcher, or film enthusiast, the IMDB movie dataset is a powerful tool for uncovering trends and gaining deeper insights into the evolving entertainment landscape.
IMDB Dataset For Machine Learning
kaggle.com
Updated Sep 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KHUSHI YADAV (2023). IMDB Dataset For Machine Learning [Dataset]. https://www.kaggle.com/datasets/khushiyadav2022/imdb-dataset-for-machine-learning
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 25, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
KHUSHI YADAV
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
"Movie Recommendation on the IMDB Dataset: A Journey into Machine Learning" is an exciting project focused on leveraging the IMDB Dataset for developing an advanced movie recommendation system. This project aims to explore the vast potential of machine learning techniques in providing personalized movie recommendations to users.

The IMDB Dataset, comprising a wealth of movie information including genres, ratings, and user reviews, serves as the foundation for this project. By harnessing the power of machine learning algorithms and data analysis, the project seeks to build a recommendation system that can accurately suggest movies tailored to each individual's preferences.
h
Data from: imdb
huggingface.co
Updated Aug 3, 2003
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford NLP (2003). imdb [Dataset]. https://huggingface.co/datasets/stanfordnlp/imdb
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 3, 2003
Dataset authored and provided by
Stanford NLP
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for "imdb"

Dataset Summary

Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
IMDb Top Rated English Movies
kaggle.com
Updated Nov 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex Quilis (2023). IMDb Top Rated English Movies [Dataset]. https://www.kaggle.com/datasets/alexq1111/imdb-top-rated-english-movies/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 10, 2023
Dataset provided by
Kaggle
Authors
Alex Quilis
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
I scraped data from IMDb to create a dataset of top-rated English movies. It includes movie names, release years, ratings, and user votes. The goal is to provide a valuable resource for movie enthusiasts and data analysts.

Sources: The data comes directly from IMDb, a popular movie information platform. I used web scraping to extract details from IMDb pages, ensuring the dataset is accurate and comprehensive.

Educational Intent: The entire data collection effort was driven by educational purposes, aiming to provide a curated dataset for analysis and exploration. Users are encouraged to leverage the dataset for educational and non-commercial purposes while being mindful of IMDb's terms of service.

Inspiration for Skill Improvement: This project helped me improve my web scraping skills, especially in navigating HTML structures and handling data extraction. I also honed my data cleaning and preprocessing abilities to ensure the dataset's quality. Analyzing and visualizing the data further improved my data analysis skills. Overall, this practical project enhanced my proficiency in handling real-world datasets.
h
autotrain-data-imdb-sentiment-analysis
huggingface.co
Updated Aug 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Feng Peng (2023). autotrain-data-imdb-sentiment-analysis [Dataset]. https://huggingface.co/datasets/linktimecloud/autotrain-data-imdb-sentiment-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 8, 2023
Authors
Feng Peng
Description
AutoTrain Dataset for project: imdb-sentiment-analysis

Dataset Description

This dataset has been automatically processed by AutoTrain for project imdb-sentiment-analysis.

Languages

The BCP-47 code for the dataset's language is en.

Dataset Structure Data Instances

A sample from this dataset looks as follows: [ { "text": "Me neither, but this flick is unfortunately one of those movies that are too bad to be good and… See the full description on the dataset page: https://huggingface.co/datasets/linktimecloud/autotrain-data-imdb-sentiment-analysis.
h
autoeval-staging-eval-project-imdb-ed2a920e-12445656
huggingface.co
Updated Aug 23, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evaluation on the Hub (2023). autoeval-staging-eval-project-imdb-ed2a920e-12445656 [Dataset]. https://huggingface.co/datasets/autoevaluate/autoeval-staging-eval-project-imdb-ed2a920e-12445656
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 23, 2023
Dataset authored and provided by
Evaluation on the Hub
Description
Dataset Card for AutoTrain Evaluator

This repository contains model predictions generated by AutoTrain for the following task and dataset:

Task: Binary Text Classification Model: lvwerra/distilbert-imdb Dataset: imdb Config: plain_text Split: test

To run new evaluation jobs, visit Hugging Face's automatic model evaluator.

Contributions

Thanks to @lvwerra for evaluating this model.
IMDB Rating BeautifulSoup Project
kaggle.com
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pawan Kumar (2023). IMDB Rating BeautifulSoup Project [Dataset]. https://www.kaggle.com/datasets/pawankumar19/imdb-rating-beautifulsoup-project/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 23, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Pawan Kumar
Description
Dataset

This dataset was created by Pawan Kumar

Contents
c
Amazon prime tv shows and movies dataset
crawlfeeds.com
csv, zip
Updated Jul 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Amazon prime tv shows and movies dataset [Dataset]. https://crawlfeeds.com/datasets/amazon-prime-tv-shows-and-movies-dataset
Explore at:
zip, csvAvailable download formats
Dataset updated
Jul 4, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Amazon Prime TV Shows and Movies Dataset offered by Crawl Feeds is an extensive resource containing over 92,000 records in JSON format. This dataset encompasses a wide array of data points, including links, titles, descriptions, release dates, genres, posters, streaming platforms, countries, number of seasons, content ratings, IMDb ratings, cast and crew details, unique identifiers, and scraping timestamps. Such comprehensive information is invaluable for researchers, data analysts, and developers aiming to conduct in-depth analyses, develop recommendation systems, or explore trends within Amazon Prime's content library.

For those interested in broader media datasets, Crawl Feeds also offers the Movies and TV Shows Dataset, which includes 118,000 records, and the IMDb Movie Details Dataset, comprising 250,000 records. These datasets provide extensive information across various platforms, facilitating comparative studies and cross-platform analyses.

Integrating these datasets into your projects can significantly enhance the depth and quality of your analyses, providing a robust foundation for exploring various facets of the entertainment industry. Whether you're developing a new application, conducting market research, or performing academic studies, these datasets serve as a valuable resource for gaining insights into the dynamic world of streaming media.

Explore the Amazon Prime TV Shows and Movies Dataset and other related datasets on Crawl Feeds to elevate your data-driven projects.
the_movies_dataset
kaggle.com
zip
Updated Jun 19, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sezgin ildes (2021). the_movies_dataset [Dataset]. https://www.kaggle.com/sezginildes/the-movies-dataset
Explore at:
zip(15456686 bytes)Available download formats
Dataset updated
Jun 19, 2021
Authors
sezgin ildes
Description
Context These files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages.

This dataset also has files containing 26 million ratings from 270,000 users for all 45,000 movies. Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website.

Content This dataset consists of the following files:

movies_metadata.csv: The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies.

keywords.csv: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.

credits.csv: Consists of Cast and Crew Information for all our movies. Available in the form of a stringified JSON Object.

links.csv: The file that contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.

links_small.csv: Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset.

ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies.

The Full MovieLens Dataset consisting of 26 million ratings and 750,000 tag applications from 270,000 users on all the 45,000 movies in this dataset can be accessed here

Acknowledgements This dataset is an ensemble of data collected from TMDB and GroupLens. The Movie Details, Credits and Keywords have been collected from the TMDB Open API. This product uses the TMDb API but is not endorsed or certified by TMDb. Their API also provides access to data on many additional movies, actors and actresses, crew members, and TV shows. You can try it for yourself here.

The Movie Links and Ratings have been obtained from the Official GroupLens website. The files are a part of the dataset available here

Inspiration This dataset was assembled as part of my second Capstone Project for Springboard's Data Science Career Track. I wanted to perform an extensive EDA on Movie Data to narrate the history and the story of Cinema and use this metadata in combination with MovieLens ratings to build various types of Recommender Systems.

Both my notebooks are available as kernels with this dataset: The Story of Film and Movie Recommender Systems

Some of the things you can do with this dataset: Predicting movie revenue and/or movie success based on a certain metric. What movies tend to get higher vote counts and vote averages on TMDB? Building Content Based and Collaborative Filtering Based Recommendation Engines.
h
autoeval-staging-eval-project-imdb-17316918-12425654
huggingface.co
Updated Nov 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evaluation on the Hub (2023). autoeval-staging-eval-project-imdb-17316918-12425654 [Dataset]. https://huggingface.co/datasets/autoevaluate/autoeval-staging-eval-project-imdb-17316918-12425654
Explore at:
Dataset updated
Nov 2, 2023
Dataset authored and provided by
Evaluation on the Hub
Description
autoevaluate/autoeval-staging-eval-project-imdb-17316918-12425654 dataset hosted on Hugging Face and contributed by the HF Datasets community
Amazon Prime TV Shows and Movies
kaggle.com
Updated May 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victor Soeiro (2022). Amazon Prime TV Shows and Movies [Dataset]. https://www.kaggle.com/datasets/victorsoeiro/amazon-prime-tv-shows-and-movies/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 14, 2022
Dataset provided by
Kaggle
Authors
Victor Soeiro
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Amazon Prime - Movies and TV Dramas

This data set was created to list all shows available on Amazon Prime streaming, and analyze the data to find interesting facts. This data was acquired in May 2022 containing data available in the United States.

Content

This dataset has two files containing the titles (titles.csv) and the cast (credits.csv) for the title.

This dataset contains +9k unique titles on Amazon Prime with 15 columns containing their information, including:

id: The title ID on JustWatch.

title: The name of the title.

show_type: TV show or movie.

description: A brief description.

release_year: The release year.

age_certification: The age certification.

runtime: The length of the episode (SHOW) or movie.

genres: A list of genres.

production_countries: A list of countries that produced the title.

seasons: Number of seasons if it's a SHOW.

imdb_id: The title ID on IMDB.

imdb_score: Score on IMDB.

imdb_votes: Votes on IMDB.

tmdb_popularity: Popularity on TMDB.

tmdb_score: Score on TMDB.

And over +124k credits of actors and directors on Amazon Prime titles with 5 columns containing their information:

person_ID: The person ID on JustWatch.

id: The title ID on JustWatch.

name: The actor or director's name.

character_name: The character name.

role: ACTOR or DIRECTOR.

Tasks

Developing a content-based recommender system using the genres and/or descriptions.

Identifying the main content available on the streaming.

Network analysis on the cast of the titles.

Exploratory data analysis to find interesting insights.

Other Streaming Datasets

HBO Max TV Shows and Movies

Netflix TV Shows and Movies

Disney+ TV Shows and Movies

Hulu TV Shows and Movies

Paramount TV Shows and Movies

Rakuten Viki TV Dramas and Movies

Crunchyroll Animes and Movies

Dark Matter TV Shows and Movies

How to obtain the data

If you want to see how I obtained these data, please check my GitHub repository.

Acknowledgements

All data were collected from JustWatch.
D
Cinema Context in RDF
ssh.datastations.nl
datacatalogue.cessda.eu
bin, css, jpeg, json +5
Updated Nov 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M. den Engelse; L. van Wissen; L. van Wissen; T. van Oort; T. van Oort; J. Noordegraaf; J. Noordegraaf; M. den Engelse (2020). Cinema Context in RDF [Dataset]. http://doi.org/10.17026/DANS-Z64-MRVB
Explore at:
css(1537), bin(3039), bin(2458), bin(5132), txt(99), text/markdown(22410), png(16829), png(292232), text/markdown(309), text/markdown(3040), bin(25150019), png(207430), bin(3392319), bin(4748), text/markdown(4949), png(392403), text/markdown(564), bin(2404), text/markdown(8975), text/markdown(14139), text/markdown(1431), text/markdown(113), text/markdown(57462), bin(684), bin(364177), json(47673), bin(3422), bin(482625), text/markdown(1326), bin(4483), bin(1649), bin(2832), bin(25), bin(3880), bin(1758), bin(18), text/plain; charset=us-ascii(95), bin(40155), bin(81921327), bin(8584), text/markdown(1738), png(4666), bin(4991), bin(3186), text/markdown(437), bin(44379358), bin(2139), bin(739995), text/markdown(2761), zip(44277), bin(6294), text/plain; charset=us-ascii(1206), jpeg(75203), bin(497163), text/markdown(28905), zip(9787798), bin(285), bin(1596394), text/markdown(251)Available download formats
Unique identifier
https://doi.org/10.17026/DANS-Z64-MRVB
Dataset updated
Nov 9, 2020
Dataset provided by
DANS Data Station Social Sciences and Humanities
Authors
M. den Engelse; L. van Wissen; L. van Wissen; T. van Oort; T. van Oort; J. Noordegraaf; J. Noordegraaf; M. den Engelse
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cinema Context is an online MySQL database containing places, persons and companies involved in more than 100,000 film screenings since 1895. CC provides insight into the ‘DNA’ of Dutch film and cinema culture and is praised by film historians worldwide. With a DANS Small Data Project grant, this data set has been converted to a Linked Data format (RDF). This data deposit contains both the RDF data set and the script used to convert the MySQL database into RDF.
Will They or Won't They Couples Data
figshare.com
txt
Updated May 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarah Lotspeich; Ashley Mullan (2024). Will They or Won't They Couples Data [Dataset]. http://doi.org/10.6084/m9.figshare.24456844.v3
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24456844.v3
Dataset updated
May 16, 2024
Dataset provided by
figshare
Authors
Sarah Lotspeich; Ashley Mullan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Many television shows follow the “will they or won’t they” trope, where the dynamic between a pair of main characters constantly shifts between friendship and something more throughout the run of the series. This trope has persisted throughout the decades, and examples include Sam and Diane from the 1980s show Cheers and Jess and Nick from the 2010s show New Girl. In some cases, the audience may wait multiple seasons before a couple like this gets together, and some suspect that producers delay the moment to create suspense and keep viewers engaged. Events marking major romantic milestones, such as the pair’s first kiss, often change the trajectory of the plot, influence the number of viewers tuning into the show, and drive up episode ratings. In this project, we scrape viewer ratings from the Internet Movie Database (IMDb) for 150 popular couples from 125 television series and then model the plot shifts following episodes with romantic milestones using causal inference methods. Specifically, we construct an interrupted time series model, where the interruption is the episode in which each couple has their first kiss. From this model, we assess whether these interruptions are associated with changes in viewer ratings on average.
Disney+ TV Shows and Movies
kaggle.com
Updated May 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victor Soeiro (2022). Disney+ TV Shows and Movies [Dataset]. https://www.kaggle.com/victorsoeiro/disney-tv-shows-and-movies/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 13, 2022
Dataset provided by
Kaggle
Authors
Victor Soeiro
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Disney+ - TV Shows and Movies

This data set was created to list all shows available on Disney+ streaming, and analyze the data to find interesting facts. This data was acquired in May 2022 containing data available in the United States.

Content

This dataset has two files containing the titles (titles.csv) and the cast (credits.csv) for the title.

This dataset contains +1500 unique titles on Disney+ with 15 columns containing their information, including:

id: The title ID on JustWatch.

title: The name of the title.

show_type: TV show or movie.

description: A brief description.

release_year: The release year.

age_certification: The age certification.

runtime: The length of the episode (SHOW) or movie.

genres: A list of genres.

production_countries: A list of countries that produced the title.

seasons: Number of seasons if it's a SHOW.

imdb_id: The title ID on IMDB.

imdb_score: Score on IMDB.

imdb_votes: Votes on IMDB.

tmdb_popularity: Popularity on TMDB.

tmdb_score: Score on TMDB.

And over +26k credits of actors and directors on Disney+ titles with 5 columns containing their information, including:

person_ID: The person ID on JustWatch.

id: The title ID on JustWatch.

name: The actor or director's name.

character_name: The character name.

role: ACTOR or DIRECTOR.

Tasks

Developing a content-based recommender system using the genres and/or descriptions.

Identifying the main content available on the streaming.

Network analysis on the cast of the titles.

Exploratory data analysis to find interesting insights.

Other Streaming Datasets

HBO Max TV Shows and Movies

Amazon Prime TV Shows and Movies

Netflix TV Shows and Movies

Hulu TV Shows and Movies

Paramount TV Shows and Movies

Rakuten Viki TV Dramas and Movies

Crunchyroll Animes and Movies

Dark Matter TV Shows and Movies

How to obtain the data

If you want to see how I obtained these data, please check my GitHub repository.

Acknowledgements

All data were collected from JustWatch.
350 000+ movies from themoviedb.org
kaggle.com
zip
Updated Oct 12, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stephanerappeneau (2017). 350 000+ movies from themoviedb.org [Dataset]. https://www.kaggle.com/stephanerappeneau/350-000-movies-from-themoviedborg
Explore at:
zip(70483259 bytes)Available download formats
Dataset updated
Oct 12, 2017
Authors
Stephanerappeneau
Description
Context

I love movies.

I tend to avoid marvel-transformers-standardized products, and prefer a mix of classic hollywood-golden-age and obscure polish artsy movies. Throw in an occasional japanese-zombie-slasher-giallo as an alibi. Good movies don't exist without bad movies.

On average I watch 200+ movies each year, with peaks at more than 500 movies. Nine years ago I started to log my movies to avoid watching the same movie twice, and also assign scores. Over the years, it gave me a couple insights on my viewing habits but nothing more than what a tenth-grader would learn at school.

I've recently suscribed to Netflix and it pains me to see the global inefficiency of recommendation systems for people like me, who mostly swear by "La politique des auteurs". It's a term coined by famous new-wave french movie critic André Bazin, meaning that the quality of a movie is essentially linked to the director and it's capacity to execute his vision with his crew. We could debate it depends on movie production pipeline, but let's not for now. Practically, what it means, is that I essentially watch movies from directors who made films I've liked.

I suspect Neflix calibrate their recommandation models taking into account the way the "average-joe" chooses a movie. A few months ago I had read a study based on a survey, showing that people chose a movie mostly based on genre (55%), then by leading actors (45%). Director or Release Date were far behind around 10% each. It is not surprising, since most people I know don't care who the director is. Lots of US blockbusters don't even mention it on the movie poster. I am aware that collaborative filtering is based on user proximity , which I believe decreases (or even eliminates) the need to characterize a movie. So here I'm more interested in content based filtering which is based on product proximity for several reasons :

Users tastes are not easily accessible. It is, after all, Netflix treasure chest

Movie offer on Netflix is so bad for someone who likes author's movies that it wouldn't help

Modeling a movie intrinsic qualities is a nice challenge

Enough.

"*The secret of getting ahead is getting started*" (Mark Twain)

https://img11.hostingpics.net/pics/117765networkgraph.png" alt="network graph">

Content

The primary source is www.themoviedb.org. If you watch obscure artsy romanian homemade movies you may find only 95% of your movies referenced...but for anyone else it should be in the 98%+ range.

movies details are from www.themoviedb.org API : movies/details

movies crew & casting are from www.themoviedb.org API : movies/credits

both can be joined by id

they contain all 350k movies up, from end of 19th century to august 2017. If you remove short movies from imdb you get similar amounts of movies.

I uploaded the program to retrieve incremental movie details on github : https://github.com/stephanerappeneau/scienceofmovies/tree/master/PycharmProjects/GetAllMovies (need a dev API key from themoviedb.org though)

I have tried various supervised (decision tree) / unsupervised (clustering, NLP) approaches described in the discussions, source code is on github : https://github.com/stephanerappeneau/scienceofmovies

As a bonus I've uploaded the bio summary from top 500 critically-acclaimed directors from wikipedia, for some interesting NLTK analysis

Here is overview of the available sources that I've tried :

• Imdb.com free csv dumps (ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/temporaryaccess/) are badly documented, incomplete, loosely structured and impossible to join/merge. There's an API hosted by Amazon Web Service : 1€ every 100 000 requests. With around 1 million movies, it could become expensive also features are bare. So I've searched for other sources.

• www.themoviedb.org is based on crowdsourcing and has an excellent API, limited to 40 requests every 10 seconds. It is quite generous, well documented, and enough to sweep the 450 000 movies in a few days. For my purpose, data quality is not significantly worse than imdb, and as imdb key is also included there's always the possibility to complete my dataset later (I actually did it)

• www.Boxofficemojo.com has some interesting budget/revenue figures (which are sorely lacking in both imdb & tmdb), but it actually tracks only a few thousand movies, mainly blockbusters. There are other professional sources that are used by film industry to get better predictive / marketing insights but that's beyond my reach for this experiment.

• www.wikipedia.com is an interesting source with no real cap on API calls, however it requires a bit of webscraping and for movies or directors the layout and quality varies a lot, so I suspected it'd get a lot of work to get insights so I put this source in lower priority.

• www.google.com will ban you after a few minutes of web scraping because their job is to scrap data from others, than sell it, duh.

• It's worth mentionning that there are a few dumps of Netflix anonymized user tastes on kaggle, because they've organised a few competitions to improve their recommendation models. https://www.kaggle.com/netflix-inc/netflix-prize-data

• Online databases are largely white anglo-saxon centric, meaning bollywood (India is the 2nd bigger producer of movies) offer is mostly absent from datasets. I'm fine with that, as it's not my cup of tea plus I lack domain knowledge. The sheer amount of indian movies would probably skew my results anyway (I don't want to have too many martial-arts-musicals in my recommendations ;-)). I have, however, tremendous respect for indian movie industry so I'd love to collaborate with an indian cinephile ! https://img11.hostingpics.net/pics/340226westerns.png" alt="Westerns">

Inspiration

Starting from there, I had multiple problem statements for both supervised / unsupervised machine learning

Can I program a tailored-recommendation system based on my own criteria ?

What are the characteristics of movies/directors I like the most ?

What is the probability that I will like my next movie ?

Can I find the data ?

One of the objectives of sharing my work here is to find cinephile data-scientists who might be interested and, hopefully, contribute or share insights :) Other interesting leads : use tagline for NLP/Clustering/Genre guessing, leverage on budget/revenue, link with other data sources using the imdb normalized title, etc.

https://img11.hostingpics.net/pics/977004matrice.png" alt="Correlation matrix">

Motivation, Disclaimer and Acknowledgements

I've graduated from an french engineering school, majoring in artificial intelligence, but that was 17 years ago right in the middle of A.I-winter. Like a lot of white male rocket scientists, I've ended up in one of the leading european investment bank, quickly abandonning IT development to specialize in trading/risk project management and internal politics. My recent appointment in the Data Office made me aware of recent breakthroughts in datascience, and I thought that developing a side project would be an excellent occasion to learn something new. Plus it'd give me a well-needed credibility which too often lack decision makers when it comes to datascience.

I've worked on some of the features with Cédric Paternotte, a fellow friend of mine who is a professor of philosophy of sciences in La Sorbonne. Working with someone with a different background seem a good idea for motivation, creativity and rigor.

Kudos to www.themoviedb.org or www.wikipedia.com sites, who really have a great attitude towards open data. This is typically NOT the case of modern-bigdata companies who mostly keep data to themselves to try to monetize it. Such a huge contrast with imdb or instagram API, which generously let you grab your last 3 comments at a miserable rate. Even if 15 years ago this seemed a mandatory path to get services for free, I predict one day governments will need to break this data monopoly.

[Disclaimer : I apologize in advance for my engrish (I'm french ^-^), any bad-code I've written (there are probably hundreds of way to do it better and faster), any pseudo-scientific assumption I've made, I'm slowly getting back in statistics and lack senior guidance, one day I regress a non-stationary time series and the day after I'll discover I shouldn't have, and any incorrect use of machine-learning models]

https://img11.hostingpics.net/pics/898068408x161poweredbyrectanglegreen.png" alt="powered by themoviedb.org">
Ten Thousand German News Articles Dataset
kaggle.com
tblock.github.io
zip
Updated Jan 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timo Block (2022). Ten Thousand German News Articles Dataset [Dataset]. https://www.kaggle.com/tblock/10kgnad
Explore at:
zip(21144764 bytes)Available download formats
Dataset updated
Jan 20, 2022
Authors
Timo Block
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
(see https://tblock.github.io/10kGNAD/ for the original dataset page)

This page introduces the 10k German News Articles Dataset (10kGNAD) german topic classification dataset. The 10kGNAD is based on the One Million Posts Corpus and avalaible under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. You can download the dataset here.

Why a German dataset?

English text classification datasets are common. Examples are the big AG News, the class-rich 20 Newsgroups and the large-scale DBpedia ontology datasets for topic classification and for example the commonly used IMDb and Yelp datasets for sentiment analysis. Non-english datasets, especially German datasets, are less common. There is a collection of sentiment analysis datasets assembled by the Interest Group on German Sentiment Analysis. However, to my knowlege, no german topic classification dataset is avaliable to the public.

Due to grammatical differences between the English and the German language, a classifyer might be effective on a English dataset, but not as effectiv on a German dataset. The German language has a higher inflection and long compound words are quite common compared to the English language. One would need to evaluate a classifyer on multiple German datasets to get a sense of it's effectivness.

The dataset

The 10kGNAD dataset is intended to solve part of this problem as the first german topic classification dataset. It consists of 10273 german language news articles from an austrian online newspaper categorized into nine topics. These articles are a till now unused part of the One Million Posts Corpus.

In the One Million Posts Corpus each article has a topic path. For example Newsroom/Wirtschaft/Wirtschaftpolitik/Finanzmaerkte/Griechenlandkrise. The 10kGNAD uses the second part of the topic path, here Wirtschaft, as class label. In result the dataset can be used for multi-class classification.

I created and used this dataset in my thesis to train and evaluate four text classifyers on the German language. By publishing the dataset I hope to support the advancement of tools and models for the German language. Additionally this dataset can be used as a benchmark dataset for german topic classification.

Numbers and statistics

As in most real-world datasets the class distribution of the 10kGNAD is not balanced. The biggest class Web consists of 1678, while the smalles class Kultur contains only 539 articles. However articles from the Web class have on average the fewest words, while artilces from the culture class have the second most words.

Splitting into train and test

I propose a stratifyed split of 10% for testing and the remaining articles for training. To use the dataset as a benchmark dataset, please used the train.csv and test.csv files located in the project root.

Code

Python scripts to extract the articles and split them into a train- and a testset avaliable in the code directory of this project. Make sure to install the requirements. The original corpus.sqlite3 is required to extract the articles (download here (compressed) or here (uncompressed)).

License

This dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Please consider citing the authors of the One Million Post Corpus if you use the dataset.
h
HebrewMetaphors
huggingface.co
Updated Oct 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technion Data and Knowledge Lab (2023). HebrewMetaphors [Dataset]. https://huggingface.co/datasets/tdklab/HebrewMetaphors
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 3, 2023
Dataset authored and provided by
Technion Data and Knowledge Lab
Description
Dataset Card for "HebrewMetaphors"

Dataset Summary

A common dataset for text classification task is IMDb. Large Movie Review Dataset. This is a dataset for binary sentiment classification. The first step in our project was to create a Hebrew dataset with an IMDB-like structure but different in that, in addition to the sentences we have, there will also be verb names, and a classification of whether the verb name is literal or metaphorical in the given sentence. Using an… See the full description on the dataset page: https://huggingface.co/datasets/tdklab/HebrewMetaphors.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Crawl Feeds (2025). IMDB movie details dataset [Dataset]. https://crawlfeeds.com/datasets/imdb-movie-details-dataset

IMDB movie details dataset

IMDB movie details dataset from imdb.com

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

zip, csvAvailable download formats

Dataset updated

Jul 5, 2025

Dataset authored and provided by

Crawl Feeds

License

https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

Description

The IMDB Movie Details Dataset is a comprehensive collection of movie datasets that offers a treasure trove of information about movies, TV shows, and streaming content listed on IMDB. This dataset includes detailed data such as titles, release years, genres, cast, crew, ratings, and more, making it a go-to resource for film and entertainment enthusiasts. Ideal for data analysis, IMDB movie dataset applications span machine learning projects, predictive modeling, and insights into industry trends.

Researchers can explore patterns in movie ratings and genre popularity, while developers can use the dataset to build recommendation systems or applications. Movie buffs can dive deep into historical and contemporary trends in the world of cinema. This dataset not only supports academic and professional pursuits but also opens doors for creative projects in storytelling, content creation, and audience engagement. Whether you’re a developer, researcher, or film enthusiast, the IMDB movie dataset is a powerful tool for uncovering trends and gaining deeper insights into the evolving entertainment landscape.

Clear search

Close search

Google apps

Main menu

IMDB movie details dataset

IMDB Dataset For Machine Learning

Data from: imdb

IMDb Top Rated English Movies

autotrain-data-imdb-sentiment-analysis

autoeval-staging-eval-project-imdb-ed2a920e-12445656

IMDB Rating BeautifulSoup Project

Dataset

Contents

Amazon prime tv shows and movies dataset

the_movies_dataset

autoeval-staging-eval-project-imdb-17316918-12425654

Amazon Prime TV Shows and Movies

Amazon Prime - Movies and TV Dramas

Content

Tasks

Other Streaming Datasets

How to obtain the data

Acknowledgements

Cinema Context in RDF

Will They or Won't They Couples Data

Disney+ TV Shows and Movies

Disney+ - TV Shows and Movies

Content

Tasks

Other Streaming Datasets

How to obtain the data

Acknowledgements

350 000+ movies from themoviedb.org

Context

Content

Inspiration

Motivation, Disclaimer and Acknowledgements

Ten Thousand German News Articles Dataset

Why a German dataset?

The dataset

Numbers and statistics

Splitting into train and test

Code

License

HebrewMetaphors

IMDB movie details dataset

IMDB movie details dataset from imdb.com