18 datasets found

Netflix Recommendation Engine Dataset
kaggle.com
zip
Updated Mar 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ritik Kumar (2024). Netflix Recommendation Engine Dataset [Dataset]. https://www.kaggle.com/datasets/ritikkumar38/netflix-recommendation-engine-dataset
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 28, 2024
Authors
Ritik Kumar
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by Ritik Kumar

Released under Apache 2.0

Contents
a
Netflix Prize Data Set
academictorrents.com
bittorrent
Updated Jan 26, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Netflix (2015). Netflix Prize Data Set [Dataset]. https://academictorrents.com/details/9b13183dc4d60676b773c9e2cd6de5e5542cee9a
Explore at:
bittorrent(697552028)Available download formats
Dataset updated
Jan 26, 2015
Dataset authored and provided by
Netflix
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
This is the official data set used in the Netflix Prize competition. The data consists of about 100 million movie ratings, and the goal is to predict missing entries in the movie-user rating matrix. |Attribute| Value| |——|—-| | Data Set Characteristics: | Multivariate, Time-Series | | Attribute Characteristics: | Integer | | Associated Tasks: | Clustering, Recommender-Systems | | Number of Instances: | 100480507 | | Number of Attributes: | 17770 | | Missing Values? | Yes | | Area: | N/A | #Data Set Information: This dataset was constructed to support participants in the Netflix Prize. There are over 480,000 customers in the dataset, each identified by a unique integer id. The title and release year for each movie is also provided. There are over 17,000 movies in the dataset, each identified by
Netflix Prize Shows Information (9000 Shows)
kaggle.com
Updated Oct 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akash Guna (2021). Netflix Prize Shows Information (9000 Shows) [Dataset]. https://www.kaggle.com/datasets/akashguna/netflix-prize-shows-information/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 24, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Akash Guna
Description
Context

Netfilx prize data is one of the popular datasets available today for OTT Recommandation. Netflix Prize Dataset contains title, userid, rating,date of rating as the only attributes for recommandation . we extend the Netflix prize dataset by scraping IMDB data about the titles in Netflix prize dataset. Any copyyright to the scraped data belongs to its respective owners.

Content

The Dataset contains information of approximately 9000 movies and tv shows available in Netflix prize datasets. Information like duration of movie, cast and crew,genre,languages,etc are present. For Columns which hold multiple values in a row arrays have been used to store those values. Please use the .json file to access the dataset to avoid string related errors.

Inspiration

Could you build a Hybrid recommandation system by combining our dataset along with Netflix Prize Dataset.

Update 1

Some movies present in imdb.csv and imdb.json have information of movies with titles same as in Netflix Prize Dataset but were made after 2005 (release of Netflix Prize Dataset) this has been corrected in imdb_processed.csv and imdb_processed.json . Please use this processed data while using the dataset for tasks specific to Netfilx Prize Dataset.

Link to Netflix Prize Dataset

https://www.kaggle.com/netflix-inc/netflix-prize-data
Z
Recommendation System Dataset
data.niaid.nih.gov
Updated Feb 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Open Source Dataset (2021). Recommendation System Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4556133
Explore at:
Dataset updated
Feb 23, 2021
Dataset authored and provided by
Open Source Dataset
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A movie dataset used for a Netflix recommendation system engine
Netflix Movies and TV Shows Dataset
kaggle.com
Updated Sep 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miraj Shah (2021). Netflix Movies and TV Shows Dataset [Dataset]. https://www.kaggle.com/datasets/mirajshah07/netflix-dataset/versions/2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 27, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Miraj Shah
Description
Dataset

This dataset was created by Miraj Shah

Contents
Netflix Prize Dataset for CreateML Recommender
kaggle.com
Updated Feb 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kari Groszewska (2025). Netflix Prize Dataset for CreateML Recommender [Dataset]. https://www.kaggle.com/datasets/karigroszewska/netflix-prize-dataset-for-createml-recommender/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 11, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kari Groszewska
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Checkout the project Github for even more details.

During GHW: February 2025, I wanted the opportunity to experiment more with the CreateML tools built into Xcode to create a recommendation system. I had previously used CreateML to make a learning/test project, but nothing quite on this scale.

Thanks to others' recommendations and scouring Kaggle, I was introduced to the Netflix Prize Data dataset, which was used for a Netflix-run contest to improve movie recommendation systems. In order to feed this dataset into CreateML, a lot of cleaning and reorganization had to be completed. CreateML requires datasets to look a specific way – having header names, userIDs, titles, and ratings. They also require separating test vs. train datasets outside.

The merge.py script was used alongside the data provided in Netflix Prize Data to better organize this dataset for learning purposes. The script and 2 final data sets were uploaded onto this page.

The CreateML recommender will be uploaded once training is completed, alongside a completed prototype of the SwiftUI application which uses the recommender.
o
Netflix IMDB Dataset
opendatabay.com
.undefined
Updated Jul 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Netflix IMDB Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/51d17d3d-7817-40a9-a400-149b5da7119c
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 4, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Entertainment & Media Consumption
Description
This dataset provides a detailed list and metadata for approximately 7,000 TV shows and movies available on Netflix as of June 2021. Sourced from the IMDB website, it offers insights into content characteristics, popularity, and categorisation, making it suitable for various analytical and machine learning applications.

Columns

imdb_id: A unique identifier for each show or movie.

title: The title of the television programme or film.

popular_rank: The ranking assigned by IMDB based on popularity.

certificate: Age certifications received by the content; it is noted that many values may be null.

startYear: The year the show was first broadcast or the film was released.

endYear: The year a show concluded, if applicable.

episodes: The total number of episodes in a series; for films, this value is 1.

runtime: The running time of the content.

type: Specifies whether the content is a 'Movie' or 'Series'.

orign_country: The country of origin for the show or movie.

language: The primary language of the content.

plot: A synopsis of the show or movie.

summary: A concise summary of the story.

rating: The average user rating for the content.

numVotes: The total number of votes received for the content's rating.

genres: The genre(s) to which the show or movie belongs.

isAdult: A binary indicator (1 for adult content, 0 otherwise).

cast: The main cast members listed in a suitable format.

image_url: A link to the poster image for the content.

Distribution

The dataset is typically provided as a CSV file, specifically named netflix_list.csv. It contains approximately 7,000 records, with 7,008 unique identifiers for shows and movies. This dataset is listed as version 1.0 and was added to the platform on 11 June 2025.

Usage

This dataset is ideally suited for developing recommender systems, performing natural language processing (NLP) tasks on plot summaries, and conducting market analysis of entertainment content. It can be used to explore trends in movie and TV show production, analyse viewer preferences, and facilitate content categorisation efforts.

Coverage

The dataset offers global coverage, with information on content originating from various countries. The startYear of content spans from 1932 to 2022, with the majority of content released between 2004 and 2022. The endYear ranges from 1969 to 2022, with most data concentrated from 2011 to 2022. It includes age certification information and an indicator for adult content, allowing for demographic considerations related to content suitability.

License

CCO

Who Can Use It

This dataset is valuable for data scientists and machine learning engineers working on content recommendation engines or text analysis projects. It is also beneficial for researchers studying media consumption patterns and entertainment industry analysts interested in exploring the Netflix content catalogue programmatically.

Dataset Name Suggestions

Netflix Content Metadata (June 2021)

Global Netflix Catalogue

Netflix IMDB Dataset

Streaming Content Insights (Netflix)

Netflix Movie and TV Show Archive

Attributes

Original Data Source:Netflix Movie and TV Shows (June 2021)
c
Netflix Movies and TV Shows Dataset
cubig.ai
Updated May 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CUBIG (2025). Netflix Movies and TV Shows Dataset [Dataset]. https://cubig.ai/store/products/261/netflix-movies-and-tv-shows-dataset
Explore at:
Dataset updated
May 25, 2025
Dataset authored and provided by
CUBIG
License
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
Measurement technique
Synthetic data generation using AI techniques for model training, Privacy-preserving data transformation via differential privacy
Description
1) Data Introduction • The Netflix Movies and TV Shows Dataset contains various metadata on movies and TV shows available on Netflix. • Key features include the title, director, cast, country, date added, release year, rating, genre, and total duration (in minutes or number of seasons) of the content.

2) Data Utilization (1) Characteristics of the Netflix Movies and TV Shows Dataset • This dataset helps in understanding content trends and markets, as well as analyzing global preferences and changing consumer tastes. • It is useful for analyzing the characteristics of content available in different countries, including genre, cast, director, and more.

(2) Applications of the Netflix Movies and TV Shows Dataset • Content Analysis: Analyze how Netflix's content is distributed, and understand preferences based on genre or country. • Recommendation System Development: Develop algorithms that recommend similar content based on user viewing patterns. • Market Analysis: Identify which content is popular in different countries and analyze if Netflix focuses more on specific countries or genres.
NetFlix-Prize-Lite
kaggle.com
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dhirendra Yadav (2023). NetFlix-Prize-Lite [Dataset]. https://www.kaggle.com/datasets/mlpedia/netflix-prize-lite
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 13, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dhirendra Yadav
Description
Netflix held the Netflix Prize open competition for the best algorithm to predict user ratings for films. The grand prize was $1,000,000 and was won by BellKor's Pragmatic Chaos team. This is the dataset that was used in that competition.

full data https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data
Netflix Recommendation System
kaggle.com
zip
Updated Feb 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gaurav Dutta (2021). Netflix Recommendation System [Dataset]. https://www.kaggle.com/gauravduttakiit/netflix-recommendation-system
Explore at:
zip(716193814 bytes)Available download formats
Dataset updated
Feb 24, 2021
Authors
Gaurav Dutta
Description
Dataset

This dataset was created by Gaurav Dutta

Contents

It contains the following files:
Netflix Movies and TV shows
kaggle.com
Updated Jan 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sandeep Bansode (2023). Netflix Movies and TV shows [Dataset]. https://www.kaggle.com/datasets/bansodesandeep/netflix-movies-and-tv-shows/discussion?sort=undefined
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 25, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sandeep Bansode
Description
Attribute Information 1. show_id : Unique ID for every Movie / Tv Show 2. type : Identifier - A Movie or TV Show 3. title : Title of the Movie / Tv Show 4. director : Director of the Movie 5. cast : Actors involved in the movie / show 6. country : Country where the movie / show was produced 7. date_added : Date it was added on Netflix 8. release_year : Actual Release year of the movie / show 9. rating : TV Rating of the movie / show 10. duration : Total Duration - in minutes or number of seasons 11. listed_in : Genre 12. description: The Summary description
Netflix Prize Data: 5 candidate elections with weak preferences
figshare.com
application/gzip
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Stricker (2023). Netflix Prize Data: 5 candidate elections with weak preferences [Dataset]. http://doi.org/10.6084/m9.figshare.3972123.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3972123.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Christian Stricker
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The Netflix Prize was a competition devised by Netflix to improve the accuracy of its recommendation system. To facilitate this Netflix released real ratings about movies from the users (voters) of the system. Any set of movies can be transformed into an election via a process outlined by Mattei, Forshee, and Goldsmith.This data set includes all 5 candidate elections with at least 350 voters generated by this process from 300 randomly chosen movies. Extending beyond prior work by Mattei et al. we allow for weak preferences, i.e., a voter is indifferent between a set of movies if he assigns the same rating to each of them. Thus, there are 541 possibilities to rank a given set of five movies.The archive is gzip compressed and includes 165,672 elections in PrefLib.org's TOC file format (Orders with Ties - Complete List).
Netflix Prize Data
kaggle.com
zip
Updated Nov 3, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elemento (2021). Netflix Prize Data [Dataset]. https://www.kaggle.com/elemento/netflix-prize-data
Explore at:
zip(3152166694 bytes)Available download formats
Dataset updated
Nov 3, 2021
Authors
Elemento
Description
Dataset

This dataset was created by Elemento

Contents
f
Data from: A NOVEL LATENT FACTOR MODEL FOR RECOMMENDER SYSTEM
scielo.figshare.com
jpeg
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bipul Kumar (2023). A NOVEL LATENT FACTOR MODEL FOR RECOMMENDER SYSTEM [Dataset]. http://doi.org/10.6084/m9.figshare.20011768.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20011768.v1
Dataset updated
Jun 1, 2023
Dataset provided by
SciELO journals
Authors
Bipul Kumar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT Matrix factorization (MF) has evolved as one of the better practice to handle sparse data in field of recommender systems. Funk singular value decomposition (SVD) is a variant of MF that exists as state-of-the-art method that enabled winning the Netflix prize competition. The method is widely used with modifications in present day research in field of recommender systems. With the potential of data points to grow at very high velocity, it is prudent to devise newer methods that can handle such data accurately as well as efficiently than Funk-SVD in the context of recommender system. In view of the growing data points, I propose a latent factor model that caters to both accuracy and efficiency by reducing the number of latent features of either users or items making it less complex than Funk-SVD, where latent features of both users and items are equal and often larger. A comprehensive empirical evaluation of accuracy on two publicly available, amazon and ml-100 k datasets reveals the comparable accuracy and lesser complexity of proposed methods than Funk-SVD.
Netflix Movie Ratings
kaggle.com
Updated Dec 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luis Heitor Ribeiro (2024). Netflix Movie Ratings [Dataset]. https://www.kaggle.com/datasets/luisheitorribeiro/netflix-movie-ratings/discussion?sort=undefined
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 9, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Luis Heitor Ribeiro
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This is a reduced dataset from a much larger Netflix's movie ratings database, for use in collaborative filtering, recommendation systems, and related applications.

Any particular user has rated only a fraction of the movies, so the data matrix is only partially filled. The goal here is to fill all the remaining entries of the matrix, and then compare with the complete test matrix.
Amazon Prime TV Shows and Movies
kaggle.com
Updated May 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victor Soeiro (2022). Amazon Prime TV Shows and Movies [Dataset]. https://www.kaggle.com/datasets/victorsoeiro/amazon-prime-tv-shows-and-movies/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 14, 2022
Dataset provided by
Kaggle
Authors
Victor Soeiro
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Amazon Prime - Movies and TV Dramas

This data set was created to list all shows available on Amazon Prime streaming, and analyze the data to find interesting facts. This data was acquired in May 2022 containing data available in the United States.

Content

This dataset has two files containing the titles (titles.csv) and the cast (credits.csv) for the title.

This dataset contains +9k unique titles on Amazon Prime with 15 columns containing their information, including:

id: The title ID on JustWatch.

title: The name of the title.

show_type: TV show or movie.

description: A brief description.

release_year: The release year.

age_certification: The age certification.

runtime: The length of the episode (SHOW) or movie.

genres: A list of genres.

production_countries: A list of countries that produced the title.

seasons: Number of seasons if it's a SHOW.

imdb_id: The title ID on IMDB.

imdb_score: Score on IMDB.

imdb_votes: Votes on IMDB.

tmdb_popularity: Popularity on TMDB.

tmdb_score: Score on TMDB.

And over +124k credits of actors and directors on Amazon Prime titles with 5 columns containing their information:

person_ID: The person ID on JustWatch.

id: The title ID on JustWatch.

name: The actor or director's name.

character_name: The character name.

role: ACTOR or DIRECTOR.

Tasks

Developing a content-based recommender system using the genres and/or descriptions.

Identifying the main content available on the streaming.

Network analysis on the cast of the titles.

Exploratory data analysis to find interesting insights.

Other Streaming Datasets

HBO Max TV Shows and Movies

Netflix TV Shows and Movies

Disney+ TV Shows and Movies

Hulu TV Shows and Movies

Paramount TV Shows and Movies

Rakuten Viki TV Dramas and Movies

Crunchyroll Animes and Movies

Dark Matter TV Shows and Movies

How to obtain the data

If you want to see how I obtained these data, please check my GitHub repository.

Acknowledgements

All data were collected from JustWatch.
Disney+ TV Shows and Movies
kaggle.com
Updated May 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victor Soeiro (2022). Disney+ TV Shows and Movies [Dataset]. https://www.kaggle.com/victorsoeiro/disney-tv-shows-and-movies/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 13, 2022
Dataset provided by
Kaggle
Authors
Victor Soeiro
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Disney+ - TV Shows and Movies

This data set was created to list all shows available on Disney+ streaming, and analyze the data to find interesting facts. This data was acquired in May 2022 containing data available in the United States.

Content

This dataset has two files containing the titles (titles.csv) and the cast (credits.csv) for the title.

This dataset contains +1500 unique titles on Disney+ with 15 columns containing their information, including:

id: The title ID on JustWatch.

title: The name of the title.

show_type: TV show or movie.

description: A brief description.

release_year: The release year.

age_certification: The age certification.

runtime: The length of the episode (SHOW) or movie.

genres: A list of genres.

production_countries: A list of countries that produced the title.

seasons: Number of seasons if it's a SHOW.

imdb_id: The title ID on IMDB.

imdb_score: Score on IMDB.

imdb_votes: Votes on IMDB.

tmdb_popularity: Popularity on TMDB.

tmdb_score: Score on TMDB.

And over +26k credits of actors and directors on Disney+ titles with 5 columns containing their information, including:

person_ID: The person ID on JustWatch.

id: The title ID on JustWatch.

name: The actor or director's name.

character_name: The character name.

role: ACTOR or DIRECTOR.

Tasks

Developing a content-based recommender system using the genres and/or descriptions.

Identifying the main content available on the streaming.

Network analysis on the cast of the titles.

Exploratory data analysis to find interesting insights.

Other Streaming Datasets

HBO Max TV Shows and Movies

Amazon Prime TV Shows and Movies

Netflix TV Shows and Movies

Hulu TV Shows and Movies

Paramount TV Shows and Movies

Rakuten Viki TV Dramas and Movies

Crunchyroll Animes and Movies

Dark Matter TV Shows and Movies

How to obtain the data

If you want to see how I obtained these data, please check my GitHub repository.

Acknowledgements

All data were collected from JustWatch.
350 000+ movies from themoviedb.org
kaggle.com
zip
Updated Oct 12, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stephanerappeneau (2017). 350 000+ movies from themoviedb.org [Dataset]. https://www.kaggle.com/stephanerappeneau/350-000-movies-from-themoviedborg
Explore at:
zip(70483259 bytes)Available download formats
Dataset updated
Oct 12, 2017
Authors
Stephanerappeneau
Description
Context

I love movies.

I tend to avoid marvel-transformers-standardized products, and prefer a mix of classic hollywood-golden-age and obscure polish artsy movies. Throw in an occasional japanese-zombie-slasher-giallo as an alibi. Good movies don't exist without bad movies.

On average I watch 200+ movies each year, with peaks at more than 500 movies. Nine years ago I started to log my movies to avoid watching the same movie twice, and also assign scores. Over the years, it gave me a couple insights on my viewing habits but nothing more than what a tenth-grader would learn at school.

I've recently suscribed to Netflix and it pains me to see the global inefficiency of recommendation systems for people like me, who mostly swear by "La politique des auteurs". It's a term coined by famous new-wave french movie critic André Bazin, meaning that the quality of a movie is essentially linked to the director and it's capacity to execute his vision with his crew. We could debate it depends on movie production pipeline, but let's not for now. Practically, what it means, is that I essentially watch movies from directors who made films I've liked.

I suspect Neflix calibrate their recommandation models taking into account the way the "average-joe" chooses a movie. A few months ago I had read a study based on a survey, showing that people chose a movie mostly based on genre (55%), then by leading actors (45%). Director or Release Date were far behind around 10% each. It is not surprising, since most people I know don't care who the director is. Lots of US blockbusters don't even mention it on the movie poster. I am aware that collaborative filtering is based on user proximity , which I believe decreases (or even eliminates) the need to characterize a movie. So here I'm more interested in content based filtering which is based on product proximity for several reasons :

Users tastes are not easily accessible. It is, after all, Netflix treasure chest

Movie offer on Netflix is so bad for someone who likes author's movies that it wouldn't help

Modeling a movie intrinsic qualities is a nice challenge

Enough.

"*The secret of getting ahead is getting started*" (Mark Twain)

https://img11.hostingpics.net/pics/117765networkgraph.png" alt="network graph">

Content

The primary source is www.themoviedb.org. If you watch obscure artsy romanian homemade movies you may find only 95% of your movies referenced...but for anyone else it should be in the 98%+ range.

movies details are from www.themoviedb.org API : movies/details

movies crew & casting are from www.themoviedb.org API : movies/credits

both can be joined by id

they contain all 350k movies up, from end of 19th century to august 2017. If you remove short movies from imdb you get similar amounts of movies.

I uploaded the program to retrieve incremental movie details on github : https://github.com/stephanerappeneau/scienceofmovies/tree/master/PycharmProjects/GetAllMovies (need a dev API key from themoviedb.org though)

I have tried various supervised (decision tree) / unsupervised (clustering, NLP) approaches described in the discussions, source code is on github : https://github.com/stephanerappeneau/scienceofmovies

As a bonus I've uploaded the bio summary from top 500 critically-acclaimed directors from wikipedia, for some interesting NLTK analysis

Here is overview of the available sources that I've tried :

• Imdb.com free csv dumps (ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/temporaryaccess/) are badly documented, incomplete, loosely structured and impossible to join/merge. There's an API hosted by Amazon Web Service : 1€ every 100 000 requests. With around 1 million movies, it could become expensive also features are bare. So I've searched for other sources.

• www.themoviedb.org is based on crowdsourcing and has an excellent API, limited to 40 requests every 10 seconds. It is quite generous, well documented, and enough to sweep the 450 000 movies in a few days. For my purpose, data quality is not significantly worse than imdb, and as imdb key is also included there's always the possibility to complete my dataset later (I actually did it)

• www.Boxofficemojo.com has some interesting budget/revenue figures (which are sorely lacking in both imdb & tmdb), but it actually tracks only a few thousand movies, mainly blockbusters. There are other professional sources that are used by film industry to get better predictive / marketing insights but that's beyond my reach for this experiment.

• www.wikipedia.com is an interesting source with no real cap on API calls, however it requires a bit of webscraping and for movies or directors the layout and quality varies a lot, so I suspected it'd get a lot of work to get insights so I put this source in lower priority.

• www.google.com will ban you after a few minutes of web scraping because their job is to scrap data from others, than sell it, duh.

• It's worth mentionning that there are a few dumps of Netflix anonymized user tastes on kaggle, because they've organised a few competitions to improve their recommendation models. https://www.kaggle.com/netflix-inc/netflix-prize-data

• Online databases are largely white anglo-saxon centric, meaning bollywood (India is the 2nd bigger producer of movies) offer is mostly absent from datasets. I'm fine with that, as it's not my cup of tea plus I lack domain knowledge. The sheer amount of indian movies would probably skew my results anyway (I don't want to have too many martial-arts-musicals in my recommendations ;-)). I have, however, tremendous respect for indian movie industry so I'd love to collaborate with an indian cinephile ! https://img11.hostingpics.net/pics/340226westerns.png" alt="Westerns">

Inspiration

Starting from there, I had multiple problem statements for both supervised / unsupervised machine learning

Can I program a tailored-recommendation system based on my own criteria ?

What are the characteristics of movies/directors I like the most ?

What is the probability that I will like my next movie ?

Can I find the data ?

One of the objectives of sharing my work here is to find cinephile data-scientists who might be interested and, hopefully, contribute or share insights :) Other interesting leads : use tagline for NLP/Clustering/Genre guessing, leverage on budget/revenue, link with other data sources using the imdb normalized title, etc.

https://img11.hostingpics.net/pics/977004matrice.png" alt="Correlation matrix">

Motivation, Disclaimer and Acknowledgements

I've graduated from an french engineering school, majoring in artificial intelligence, but that was 17 years ago right in the middle of A.I-winter. Like a lot of white male rocket scientists, I've ended up in one of the leading european investment bank, quickly abandonning IT development to specialize in trading/risk project management and internal politics. My recent appointment in the Data Office made me aware of recent breakthroughts in datascience, and I thought that developing a side project would be an excellent occasion to learn something new. Plus it'd give me a well-needed credibility which too often lack decision makers when it comes to datascience.

I've worked on some of the features with Cédric Paternotte, a fellow friend of mine who is a professor of philosophy of sciences in La Sorbonne. Working with someone with a different background seem a good idea for motivation, creativity and rigor.

Kudos to www.themoviedb.org or www.wikipedia.com sites, who really have a great attitude towards open data. This is typically NOT the case of modern-bigdata companies who mostly keep data to themselves to try to monetize it. Such a huge contrast with imdb or instagram API, which generously let you grab your last 3 comments at a miserable rate. Even if 15 years ago this seemed a mandatory path to get services for free, I predict one day governments will need to break this data monopoly.

[Disclaimer : I apologize in advance for my engrish (I'm french ^-^), any bad-code I've written (there are probably hundreds of way to do it better and faster), any pseudo-scientific assumption I've made, I'm slowly getting back in statistics and lack senior guidance, one day I regress a non-stationary time series and the day after I'll discover I shouldn't have, and any incorrect use of machine-learning models]

https://img11.hostingpics.net/pics/898068408x161poweredbyrectanglegreen.png" alt="powered by themoviedb.org">
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ritik Kumar (2024). Netflix Recommendation Engine Dataset [Dataset]. https://www.kaggle.com/datasets/ritikkumar38/netflix-recommendation-engine-dataset

Netflix Recommendation Engine Dataset

Explore at:

100 scholarly articles cite this dataset (View in Google Scholar)

zip(0 bytes)Available download formats

Dataset updated

Mar 28, 2024

Authors

Ritik Kumar

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Dataset

This dataset was created by Ritik Kumar

Released under Apache 2.0

Clear search

Close search

Google apps

Main menu

Netflix Recommendation Engine Dataset

Dataset

Contents

Netflix Prize Data Set

Netflix Prize Shows Information (9000 Shows)

Context

Content

Inspiration

Update 1

Link to Netflix Prize Dataset

Recommendation System Dataset

Netflix Movies and TV Shows Dataset

Dataset

Contents

Netflix Prize Dataset for CreateML Recommender

Netflix IMDB Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Netflix Movies and TV Shows Dataset

NetFlix-Prize-Lite

Netflix Recommendation System

Dataset

Contents

Netflix Movies and TV shows

Netflix Prize Data: 5 candidate elections with weak preferences

Netflix Prize Data

Dataset

Contents

Data from: A NOVEL LATENT FACTOR MODEL FOR RECOMMENDER SYSTEM

Netflix Movie Ratings

Amazon Prime TV Shows and Movies

Amazon Prime - Movies and TV Dramas

Content

Tasks

Other Streaming Datasets

How to obtain the data

Acknowledgements

Disney+ TV Shows and Movies

Disney+ - TV Shows and Movies

Content

Tasks

Other Streaming Datasets

How to obtain the data

Acknowledgements

350 000+ movies from themoviedb.org

Context

Content

Inspiration

Motivation, Disclaimer and Acknowledgements

Netflix Recommendation Engine Dataset

Dataset

Contents