Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Ritik Kumar
Released under Apache 2.0
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
This is the official data set used in the Netflix Prize competition. The data consists of about 100 million movie ratings, and the goal is to predict missing entries in the movie-user rating matrix. |Attribute| Value| |——|—-| | Data Set Characteristics: | Multivariate, Time-Series | | Attribute Characteristics: | Integer | | Associated Tasks: | Clustering, Recommender-Systems | | Number of Instances: | 100480507 | | Number of Attributes: | 17770 | | Missing Values? | Yes | | Area: | N/A | #Data Set Information: This dataset was constructed to support participants in the Netflix Prize. There are over 480,000 customers in the dataset, each identified by a unique integer id. The title and release year for each movie is also provided. There are over 17,000 movies in the dataset, each identified by
Netfilx prize data is one of the popular datasets available today for OTT Recommandation. Netflix Prize Dataset contains title, userid, rating,date of rating as the only attributes for recommandation . we extend the Netflix prize dataset by scraping IMDB data about the titles in Netflix prize dataset. Any copyyright to the scraped data belongs to its respective owners.
The Dataset contains information of approximately 9000 movies and tv shows available in Netflix prize datasets. Information like duration of movie, cast and crew,genre,languages,etc are present. For Columns which hold multiple values in a row arrays have been used to store those values. Please use the .json file to access the dataset to avoid string related errors.
Could you build a Hybrid recommandation system by combining our dataset along with Netflix Prize Dataset.
Some movies present in imdb.csv and imdb.json have information of movies with titles same as in Netflix Prize Dataset but were made after 2005 (release of Netflix Prize Dataset) this has been corrected in imdb_processed.csv and imdb_processed.json . Please use this processed data while using the dataset for tasks specific to Netfilx Prize Dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A movie dataset used for a Netflix recommendation system engine
This dataset was created by Miraj Shah
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Checkout the project Github for even more details.
During GHW: February 2025, I wanted the opportunity to experiment more with the CreateML tools built into Xcode to create a recommendation system. I had previously used CreateML to make a learning/test project, but nothing quite on this scale.
Thanks to others' recommendations and scouring Kaggle, I was introduced to the Netflix Prize Data dataset, which was used for a Netflix-run contest to improve movie recommendation systems. In order to feed this dataset into CreateML, a lot of cleaning and reorganization had to be completed. CreateML requires datasets to look a specific way – having header names, userIDs, titles, and ratings. They also require separating test vs. train datasets outside.
The merge.py script was used alongside the data provided in Netflix Prize Data to better organize this dataset for learning purposes. The script and 2 final data sets were uploaded onto this page.
The CreateML recommender will be uploaded once training is completed, alongside a completed prototype of the SwiftUI application which uses the recommender.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides a detailed list and metadata for approximately 7,000 TV shows and movies available on Netflix as of June 2021. Sourced from the IMDB website, it offers insights into content characteristics, popularity, and categorisation, making it suitable for various analytical and machine learning applications.
The dataset is typically provided as a CSV file, specifically named netflix_list.csv
. It contains approximately 7,000 records, with 7,008 unique identifiers for shows and movies. This dataset is listed as version 1.0 and was added to the platform on 11 June 2025.
This dataset is ideally suited for developing recommender systems, performing natural language processing (NLP) tasks on plot summaries, and conducting market analysis of entertainment content. It can be used to explore trends in movie and TV show production, analyse viewer preferences, and facilitate content categorisation efforts.
The dataset offers global coverage, with information on content originating from various countries. The startYear
of content spans from 1932 to 2022, with the majority of content released between 2004 and 2022. The endYear
ranges from 1969 to 2022, with most data concentrated from 2011 to 2022. It includes age certification information and an indicator for adult content, allowing for demographic considerations related to content suitability.
CCO
This dataset is valuable for data scientists and machine learning engineers working on content recommendation engines or text analysis projects. It is also beneficial for researchers studying media consumption patterns and entertainment industry analysts interested in exploring the Netflix content catalogue programmatically.
Original Data Source:Netflix Movie and TV Shows (June 2021)
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Netflix Movies and TV Shows Dataset contains various metadata on movies and TV shows available on Netflix. • Key features include the title, director, cast, country, date added, release year, rating, genre, and total duration (in minutes or number of seasons) of the content.
2) Data Utilization (1) Characteristics of the Netflix Movies and TV Shows Dataset • This dataset helps in understanding content trends and markets, as well as analyzing global preferences and changing consumer tastes. • It is useful for analyzing the characteristics of content available in different countries, including genre, cast, director, and more.
(2) Applications of the Netflix Movies and TV Shows Dataset • Content Analysis: Analyze how Netflix's content is distributed, and understand preferences based on genre or country. • Recommendation System Development: Develop algorithms that recommend similar content based on user viewing patterns. • Market Analysis: Identify which content is popular in different countries and analyze if Netflix focuses more on specific countries or genres.
Netflix held the Netflix Prize open competition for the best algorithm to predict user ratings for films. The grand prize was $1,000,000 and was won by BellKor's Pragmatic Chaos team. This is the dataset that was used in that competition.
full data https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data
This dataset was created by Gaurav Dutta
It contains the following files:
Attribute Information 1. show_id : Unique ID for every Movie / Tv Show 2. type : Identifier - A Movie or TV Show 3. title : Title of the Movie / Tv Show 4. director : Director of the Movie 5. cast : Actors involved in the movie / show 6. country : Country where the movie / show was produced 7. date_added : Date it was added on Netflix 8. release_year : Actual Release year of the movie / show 9. rating : TV Rating of the movie / show 10. duration : Total Duration - in minutes or number of seasons 11. listed_in : Genre 12. description: The Summary description
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Netflix Prize was a competition devised by Netflix to improve the accuracy of its recommendation system. To facilitate this Netflix released real ratings about movies from the users (voters) of the system. Any set of movies can be transformed into an election via a process outlined by Mattei, Forshee, and Goldsmith.This data set includes all 5 candidate elections with at least 350 voters generated by this process from 300 randomly chosen movies. Extending beyond prior work by Mattei et al. we allow for weak preferences, i.e., a voter is indifferent between a set of movies if he assigns the same rating to each of them. Thus, there are 541 possibilities to rank a given set of five movies.The archive is gzip compressed and includes 165,672 elections in PrefLib.org's TOC file format (Orders with Ties - Complete List).
This dataset was created by Elemento
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT Matrix factorization (MF) has evolved as one of the better practice to handle sparse data in field of recommender systems. Funk singular value decomposition (SVD) is a variant of MF that exists as state-of-the-art method that enabled winning the Netflix prize competition. The method is widely used with modifications in present day research in field of recommender systems. With the potential of data points to grow at very high velocity, it is prudent to devise newer methods that can handle such data accurately as well as efficiently than Funk-SVD in the context of recommender system. In view of the growing data points, I propose a latent factor model that caters to both accuracy and efficiency by reducing the number of latent features of either users or items making it less complex than Funk-SVD, where latent features of both users and items are equal and often larger. A comprehensive empirical evaluation of accuracy on two publicly available, amazon and ml-100 k datasets reveals the comparable accuracy and lesser complexity of proposed methods than Funk-SVD.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This is a reduced dataset from a much larger Netflix's movie ratings database, for use in collaborative filtering, recommendation systems, and related applications.
Any particular user has rated only a fraction of the movies, so the data matrix is only partially filled. The goal here is to fill all the remaining entries of the matrix, and then compare with the complete test matrix.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This data set was created to list all shows available on Amazon Prime streaming, and analyze the data to find interesting facts. This data was acquired in May 2022 containing data available in the United States.
This dataset has two files containing the titles (titles.csv) and the cast (credits.csv) for the title.
This dataset contains +9k unique titles on Amazon Prime with 15 columns containing their information, including:
- id: The title ID on JustWatch.
- title: The name of the title.
- show_type: TV show or movie.
- description: A brief description.
- release_year: The release year.
- age_certification: The age certification.
- runtime: The length of the episode (SHOW) or movie.
- genres: A list of genres.
- production_countries: A list of countries that produced the title.
- seasons: Number of seasons if it's a SHOW.
- imdb_id: The title ID on IMDB.
- imdb_score: Score on IMDB.
- imdb_votes: Votes on IMDB.
- tmdb_popularity: Popularity on TMDB.
- tmdb_score: Score on TMDB.
And over +124k credits of actors and directors on Amazon Prime titles with 5 columns containing their information:
- person_ID: The person ID on JustWatch.
- id: The title ID on JustWatch.
- name: The actor or director's name.
- character_name: The character name.
- role: ACTOR or DIRECTOR.
- Developing a content-based recommender system using the genres and/or descriptions.
- Identifying the main content available on the streaming.
- Network analysis on the cast of the titles.
- Exploratory data analysis to find interesting insights.
If you want to see how I obtained these data, please check my GitHub repository.
All data were collected from JustWatch.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This data set was created to list all shows available on Disney+ streaming, and analyze the data to find interesting facts. This data was acquired in May 2022 containing data available in the United States.
This dataset has two files containing the titles (titles.csv) and the cast (credits.csv) for the title.
This dataset contains +1500 unique titles on Disney+ with 15 columns containing their information, including:
- id: The title ID on JustWatch.
- title: The name of the title.
- show_type: TV show or movie.
- description: A brief description.
- release_year: The release year.
- age_certification: The age certification.
- runtime: The length of the episode (SHOW) or movie.
- genres: A list of genres.
- production_countries: A list of countries that produced the title.
- seasons: Number of seasons if it's a SHOW.
- imdb_id: The title ID on IMDB.
- imdb_score: Score on IMDB.
- imdb_votes: Votes on IMDB.
- tmdb_popularity: Popularity on TMDB.
- tmdb_score: Score on TMDB.
And over +26k credits of actors and directors on Disney+ titles with 5 columns containing their information, including:
- person_ID: The person ID on JustWatch.
- id: The title ID on JustWatch.
- name: The actor or director's name.
- character_name: The character name.
- role: ACTOR or DIRECTOR.
- Developing a content-based recommender system using the genres and/or descriptions.
- Identifying the main content available on the streaming.
- Network analysis on the cast of the titles.
- Exploratory data analysis to find interesting insights.
If you want to see how I obtained these data, please check my GitHub repository.
All data were collected from JustWatch.
I love movies.
I tend to avoid marvel-transformers-standardized products, and prefer a mix of classic hollywood-golden-age and obscure polish artsy movies. Throw in an occasional japanese-zombie-slasher-giallo as an alibi. Good movies don't exist without bad movies.
On average I watch 200+ movies each year, with peaks at more than 500 movies. Nine years ago I started to log my movies to avoid watching the same movie twice, and also assign scores. Over the years, it gave me a couple insights on my viewing habits but nothing more than what a tenth-grader would learn at school.
I've recently suscribed to Netflix and it pains me to see the global inefficiency of recommendation systems for people like me, who mostly swear by "La politique des auteurs". It's a term coined by famous new-wave french movie critic André Bazin, meaning that the quality of a movie is essentially linked to the director and it's capacity to execute his vision with his crew. We could debate it depends on movie production pipeline, but let's not for now. Practically, what it means, is that I essentially watch movies from directors who made films I've liked.
I suspect Neflix calibrate their recommandation models taking into account the way the "average-joe" chooses a movie. A few months ago I had read a study based on a survey, showing that people chose a movie mostly based on genre (55%), then by leading actors (45%). Director or Release Date were far behind around 10% each. It is not surprising, since most people I know don't care who the director is. Lots of US blockbusters don't even mention it on the movie poster. I am aware that collaborative filtering is based on user proximity , which I believe decreases (or even eliminates) the need to characterize a movie. So here I'm more interested in content based filtering which is based on product proximity for several reasons :
Users tastes are not easily accessible. It is, after all, Netflix treasure chest
Movie offer on Netflix is so bad for someone who likes author's movies that it wouldn't help
Modeling a movie intrinsic qualities is a nice challenge
Enough.
"*The secret of getting ahead is getting started*" (Mark Twain)
https://img11.hostingpics.net/pics/117765networkgraph.png" alt="network graph">
The primary source is www.themoviedb.org. If you watch obscure artsy romanian homemade movies you may find only 95% of your movies referenced...but for anyone else it should be in the 98%+ range.
movies details are from www.themoviedb.org API : movies/details
movies crew & casting are from www.themoviedb.org API : movies/credits
both can be joined by id
they contain all 350k movies up, from end of 19th century to august 2017. If you remove short movies from imdb you get similar amounts of movies.
I uploaded the program to retrieve incremental movie details on github : https://github.com/stephanerappeneau/scienceofmovies/tree/master/PycharmProjects/GetAllMovies (need a dev API key from themoviedb.org though)
I have tried various supervised (decision tree) / unsupervised (clustering, NLP) approaches described in the discussions, source code is on github : https://github.com/stephanerappeneau/scienceofmovies
As a bonus I've uploaded the bio summary from top 500 critically-acclaimed directors from wikipedia, for some interesting NLTK analysis
Here is overview of the available sources that I've tried :
• Imdb.com free csv dumps (ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/temporaryaccess/) are badly documented, incomplete, loosely structured and impossible to join/merge. There's an API hosted by Amazon Web Service : 1€ every 100 000 requests. With around 1 million movies, it could become expensive also features are bare. So I've searched for other sources.
• www.themoviedb.org is based on crowdsourcing and has an excellent API, limited to 40 requests every 10 seconds. It is quite generous, well documented, and enough to sweep the 450 000 movies in a few days. For my purpose, data quality is not significantly worse than imdb, and as imdb key is also included there's always the possibility to complete my dataset later (I actually did it)
• www.Boxofficemojo.com has some interesting budget/revenue figures (which are sorely lacking in both imdb & tmdb), but it actually tracks only a few thousand movies, mainly blockbusters. There are other professional sources that are used by film industry to get better predictive / marketing insights but that's beyond my reach for this experiment.
• www.wikipedia.com is an interesting source with no real cap on API calls, however it requires a bit of webscraping and for movies or directors the layout and quality varies a lot, so I suspected it'd get a lot of work to get insights so I put this source in lower priority.
• www.google.com will ban you after a few minutes of web scraping because their job is to scrap data from others, than sell it, duh.
• It's worth mentionning that there are a few dumps of Netflix anonymized user tastes on kaggle, because they've organised a few competitions to improve their recommendation models. https://www.kaggle.com/netflix-inc/netflix-prize-data
• Online databases are largely white anglo-saxon centric, meaning bollywood (India is the 2nd bigger producer of movies) offer is mostly absent from datasets. I'm fine with that, as it's not my cup of tea plus I lack domain knowledge. The sheer amount of indian movies would probably skew my results anyway (I don't want to have too many martial-arts-musicals in my recommendations ;-)). I have, however, tremendous respect for indian movie industry so I'd love to collaborate with an indian cinephile !
https://img11.hostingpics.net/pics/340226westerns.png" alt="Westerns">
Starting from there, I had multiple problem statements for both supervised / unsupervised machine learning
Can I program a tailored-recommendation system based on my own criteria ?
What are the characteristics of movies/directors I like the most ?
What is the probability that I will like my next movie ?
Can I find the data ?
One of the objectives of sharing my work here is to find cinephile data-scientists who might be interested and, hopefully, contribute or share insights :) Other interesting leads : use tagline for NLP/Clustering/Genre guessing, leverage on budget/revenue, link with other data sources using the imdb normalized title, etc.
https://img11.hostingpics.net/pics/977004matrice.png" alt="Correlation matrix">
I've graduated from an french engineering school, majoring in artificial intelligence, but that was 17 years ago right in the middle of A.I-winter. Like a lot of white male rocket scientists, I've ended up in one of the leading european investment bank, quickly abandonning IT development to specialize in trading/risk project management and internal politics. My recent appointment in the Data Office made me aware of recent breakthroughts in datascience, and I thought that developing a side project would be an excellent occasion to learn something new. Plus it'd give me a well-needed credibility which too often lack decision makers when it comes to datascience.
I've worked on some of the features with Cédric Paternotte, a fellow friend of mine who is a professor of philosophy of sciences in La Sorbonne. Working with someone with a different background seem a good idea for motivation, creativity and rigor.
Kudos to www.themoviedb.org or www.wikipedia.com sites, who really have a great attitude towards open data. This is typically NOT the case of modern-bigdata companies who mostly keep data to themselves to try to monetize it. Such a huge contrast with imdb or instagram API, which generously let you grab your last 3 comments at a miserable rate. Even if 15 years ago this seemed a mandatory path to get services for free, I predict one day governments will need to break this data monopoly.
[Disclaimer : I apologize in advance for my engrish (I'm french ^-^), any bad-code I've written (there are probably hundreds of way to do it better and faster), any pseudo-scientific assumption I've made, I'm slowly getting back in statistics and lack senior guidance, one day I regress a non-stationary time series and the day after I'll discover I shouldn't have, and any incorrect use of machine-learning models]
https://img11.hostingpics.net/pics/898068408x161poweredbyrectanglegreen.png" alt="powered by themoviedb.org">
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Ritik Kumar
Released under Apache 2.0