18 datasets found
  1. Netflix Recommendation Engine Dataset

    • kaggle.com
    zip
    Updated Mar 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ritik Kumar (2024). Netflix Recommendation Engine Dataset [Dataset]. https://www.kaggle.com/datasets/ritikkumar38/netflix-recommendation-engine-dataset
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 28, 2024
    Authors
    Ritik Kumar
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Ritik Kumar

    Released under Apache 2.0

    Contents

  2. a

    Netflix Prize Data Set

    • academictorrents.com
    bittorrent
    Updated Jan 26, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Netflix (2015). Netflix Prize Data Set [Dataset]. https://academictorrents.com/details/9b13183dc4d60676b773c9e2cd6de5e5542cee9a
    Explore at:
    bittorrent(697552028)Available download formats
    Dataset updated
    Jan 26, 2015
    Dataset authored and provided by
    Netflix
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    This is the official data set used in the Netflix Prize competition. The data consists of about 100 million movie ratings, and the goal is to predict missing entries in the movie-user rating matrix. |Attribute| Value| |——|—-| | Data Set Characteristics: | Multivariate, Time-Series | | Attribute Characteristics: | Integer | | Associated Tasks: | Clustering, Recommender-Systems | | Number of Instances: | 100480507 | | Number of Attributes: | 17770 | | Missing Values? | Yes | | Area: | N/A | #Data Set Information: This dataset was constructed to support participants in the Netflix Prize. There are over 480,000 customers in the dataset, each identified by a unique integer id. The title and release year for each movie is also provided. There are over 17,000 movies in the dataset, each identified by

  3. Netflix Prize Shows Information (9000 Shows)

    • kaggle.com
    Updated Oct 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akash Guna (2021). Netflix Prize Shows Information (9000 Shows) [Dataset]. https://www.kaggle.com/datasets/akashguna/netflix-prize-shows-information/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 24, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Akash Guna
    Description

    Context

    Netfilx prize data is one of the popular datasets available today for OTT Recommandation. Netflix Prize Dataset contains title, userid, rating,date of rating as the only attributes for recommandation . we extend the Netflix prize dataset by scraping IMDB data about the titles in Netflix prize dataset. Any copyyright to the scraped data belongs to its respective owners.

    Content

    The Dataset contains information of approximately 9000 movies and tv shows available in Netflix prize datasets. Information like duration of movie, cast and crew,genre,languages,etc are present. For Columns which hold multiple values in a row arrays have been used to store those values. Please use the .json file to access the dataset to avoid string related errors.

    Inspiration

    Could you build a Hybrid recommandation system by combining our dataset along with Netflix Prize Dataset.

    Update 1

    Some movies present in imdb.csv and imdb.json have information of movies with titles same as in Netflix Prize Dataset but were made after 2005 (release of Netflix Prize Dataset) this has been corrected in imdb_processed.csv and imdb_processed.json . Please use this processed data while using the dataset for tasks specific to Netfilx Prize Dataset.

    Link to Netflix Prize Dataset

    https://www.kaggle.com/netflix-inc/netflix-prize-data

  4. Z

    Recommendation System Dataset

    • data.niaid.nih.gov
    Updated Feb 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open Source Dataset (2021). Recommendation System Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4556133
    Explore at:
    Dataset updated
    Feb 23, 2021
    Dataset authored and provided by
    Open Source Dataset
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A movie dataset used for a Netflix recommendation system engine

  5. Netflix Movies and TV Shows Dataset

    • kaggle.com
    Updated Sep 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miraj Shah (2021). Netflix Movies and TV Shows Dataset [Dataset]. https://www.kaggle.com/datasets/mirajshah07/netflix-dataset/versions/2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 27, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Miraj Shah
    Description

    Dataset

    This dataset was created by Miraj Shah

    Contents

  6. Netflix Prize Dataset for CreateML Recommender

    • kaggle.com
    Updated Feb 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kari Groszewska (2025). Netflix Prize Dataset for CreateML Recommender [Dataset]. https://www.kaggle.com/datasets/karigroszewska/netflix-prize-dataset-for-createml-recommender/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 11, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kari Groszewska
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Checkout the project Github for even more details.

    During GHW: February 2025, I wanted the opportunity to experiment more with the CreateML tools built into Xcode to create a recommendation system. I had previously used CreateML to make a learning/test project, but nothing quite on this scale.

    Thanks to others' recommendations and scouring Kaggle, I was introduced to the Netflix Prize Data dataset, which was used for a Netflix-run contest to improve movie recommendation systems. In order to feed this dataset into CreateML, a lot of cleaning and reorganization had to be completed. CreateML requires datasets to look a specific way – having header names, userIDs, titles, and ratings. They also require separating test vs. train datasets outside.

    The merge.py script was used alongside the data provided in Netflix Prize Data to better organize this dataset for learning purposes. The script and 2 final data sets were uploaded onto this page.

    The CreateML recommender will be uploaded once training is completed, alongside a completed prototype of the SwiftUI application which uses the recommender.

  7. o

    Netflix IMDB Dataset

    • opendatabay.com
    .undefined
    Updated Jul 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Netflix IMDB Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/51d17d3d-7817-40a9-a400-149b5da7119c
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 4, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Entertainment & Media Consumption
    Description

    This dataset provides a detailed list and metadata for approximately 7,000 TV shows and movies available on Netflix as of June 2021. Sourced from the IMDB website, it offers insights into content characteristics, popularity, and categorisation, making it suitable for various analytical and machine learning applications.

    Columns

    • imdb_id: A unique identifier for each show or movie.
    • title: The title of the television programme or film.
    • popular_rank: The ranking assigned by IMDB based on popularity.
    • certificate: Age certifications received by the content; it is noted that many values may be null.
    • startYear: The year the show was first broadcast or the film was released.
    • endYear: The year a show concluded, if applicable.
    • episodes: The total number of episodes in a series; for films, this value is 1.
    • runtime: The running time of the content.
    • type: Specifies whether the content is a 'Movie' or 'Series'.
    • orign_country: The country of origin for the show or movie.
    • language: The primary language of the content.
    • plot: A synopsis of the show or movie.
    • summary: A concise summary of the story.
    • rating: The average user rating for the content.
    • numVotes: The total number of votes received for the content's rating.
    • genres: The genre(s) to which the show or movie belongs.
    • isAdult: A binary indicator (1 for adult content, 0 otherwise).
    • cast: The main cast members listed in a suitable format.
    • image_url: A link to the poster image for the content.

    Distribution

    The dataset is typically provided as a CSV file, specifically named netflix_list.csv. It contains approximately 7,000 records, with 7,008 unique identifiers for shows and movies. This dataset is listed as version 1.0 and was added to the platform on 11 June 2025.

    Usage

    This dataset is ideally suited for developing recommender systems, performing natural language processing (NLP) tasks on plot summaries, and conducting market analysis of entertainment content. It can be used to explore trends in movie and TV show production, analyse viewer preferences, and facilitate content categorisation efforts.

    Coverage

    The dataset offers global coverage, with information on content originating from various countries. The startYear of content spans from 1932 to 2022, with the majority of content released between 2004 and 2022. The endYear ranges from 1969 to 2022, with most data concentrated from 2011 to 2022. It includes age certification information and an indicator for adult content, allowing for demographic considerations related to content suitability.

    License

    CCO

    Who Can Use It

    This dataset is valuable for data scientists and machine learning engineers working on content recommendation engines or text analysis projects. It is also beneficial for researchers studying media consumption patterns and entertainment industry analysts interested in exploring the Netflix content catalogue programmatically.

    Dataset Name Suggestions

    • Netflix Content Metadata (June 2021)
    • Global Netflix Catalogue
    • Netflix IMDB Dataset
    • Streaming Content Insights (Netflix)
    • Netflix Movie and TV Show Archive

    Attributes

    Original Data Source:Netflix Movie and TV Shows (June 2021)

  8. c

    Netflix Movies and TV Shows Dataset

    • cubig.ai
    Updated May 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CUBIG (2025). Netflix Movies and TV Shows Dataset [Dataset]. https://cubig.ai/store/products/261/netflix-movies-and-tv-shows-dataset
    Explore at:
    Dataset updated
    May 25, 2025
    Dataset authored and provided by
    CUBIG
    License

    https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

    Measurement technique
    Synthetic data generation using AI techniques for model training, Privacy-preserving data transformation via differential privacy
    Description

    1) Data Introduction • The Netflix Movies and TV Shows Dataset contains various metadata on movies and TV shows available on Netflix. • Key features include the title, director, cast, country, date added, release year, rating, genre, and total duration (in minutes or number of seasons) of the content.

    2) Data Utilization (1) Characteristics of the Netflix Movies and TV Shows Dataset • This dataset helps in understanding content trends and markets, as well as analyzing global preferences and changing consumer tastes. • It is useful for analyzing the characteristics of content available in different countries, including genre, cast, director, and more.

    (2) Applications of the Netflix Movies and TV Shows Dataset • Content Analysis: Analyze how Netflix's content is distributed, and understand preferences based on genre or country. • Recommendation System Development: Develop algorithms that recommend similar content based on user viewing patterns. • Market Analysis: Identify which content is popular in different countries and analyze if Netflix focuses more on specific countries or genres.

  9. NetFlix-Prize-Lite

    • kaggle.com
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dhirendra Yadav (2023). NetFlix-Prize-Lite [Dataset]. https://www.kaggle.com/datasets/mlpedia/netflix-prize-lite
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Dhirendra Yadav
    Description

    Netflix held the Netflix Prize open competition for the best algorithm to predict user ratings for films. The grand prize was $1,000,000 and was won by BellKor's Pragmatic Chaos team. This is the dataset that was used in that competition.

    full data https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data

  10. Netflix Recommendation System

    • kaggle.com
    zip
    Updated Feb 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaurav Dutta (2021). Netflix Recommendation System [Dataset]. https://www.kaggle.com/gauravduttakiit/netflix-recommendation-system
    Explore at:
    zip(716193814 bytes)Available download formats
    Dataset updated
    Feb 24, 2021
    Authors
    Gaurav Dutta
    Description

    Dataset

    This dataset was created by Gaurav Dutta

    Contents

    It contains the following files:

  11. Netflix Movies and TV shows

    • kaggle.com
    Updated Jan 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sandeep Bansode (2023). Netflix Movies and TV shows [Dataset]. https://www.kaggle.com/datasets/bansodesandeep/netflix-movies-and-tv-shows/discussion?sort=undefined
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 25, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sandeep Bansode
    Description

    Attribute Information 1. show_id : Unique ID for every Movie / Tv Show 2. type : Identifier - A Movie or TV Show 3. title : Title of the Movie / Tv Show 4. director : Director of the Movie 5. cast : Actors involved in the movie / show 6. country : Country where the movie / show was produced 7. date_added : Date it was added on Netflix 8. release_year : Actual Release year of the movie / show 9. rating : TV Rating of the movie / show 10. duration : Total Duration - in minutes or number of seasons 11. listed_in : Genre 12. description: The Summary description

  12. Netflix Prize Data: 5 candidate elections with weak preferences

    • figshare.com
    application/gzip
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian Stricker (2023). Netflix Prize Data: 5 candidate elections with weak preferences [Dataset]. http://doi.org/10.6084/m9.figshare.3972123.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Christian Stricker
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The Netflix Prize was a competition devised by Netflix to improve the accuracy of its recommendation system. To facilitate this Netflix released real ratings about movies from the users (voters) of the system. Any set of movies can be transformed into an election via a process outlined by Mattei, Forshee, and Goldsmith.This data set includes all 5 candidate elections with at least 350 voters generated by this process from 300 randomly chosen movies. Extending beyond prior work by Mattei et al. we allow for weak preferences, i.e., a voter is indifferent between a set of movies if he assigns the same rating to each of them. Thus, there are 541 possibilities to rank a given set of five movies.The archive is gzip compressed and includes 165,672 elections in PrefLib.org's TOC file format (Orders with Ties - Complete List).

  13. Netflix Prize Data

    • kaggle.com
    zip
    Updated Nov 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elemento (2021). Netflix Prize Data [Dataset]. https://www.kaggle.com/elemento/netflix-prize-data
    Explore at:
    zip(3152166694 bytes)Available download formats
    Dataset updated
    Nov 3, 2021
    Authors
    Elemento
    Description

    Dataset

    This dataset was created by Elemento

    Contents

  14. f

    Data from: A NOVEL LATENT FACTOR MODEL FOR RECOMMENDER SYSTEM

    • scielo.figshare.com
    jpeg
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bipul Kumar (2023). A NOVEL LATENT FACTOR MODEL FOR RECOMMENDER SYSTEM [Dataset]. http://doi.org/10.6084/m9.figshare.20011768.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    SciELO journals
    Authors
    Bipul Kumar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABSTRACT Matrix factorization (MF) has evolved as one of the better practice to handle sparse data in field of recommender systems. Funk singular value decomposition (SVD) is a variant of MF that exists as state-of-the-art method that enabled winning the Netflix prize competition. The method is widely used with modifications in present day research in field of recommender systems. With the potential of data points to grow at very high velocity, it is prudent to devise newer methods that can handle such data accurately as well as efficiently than Funk-SVD in the context of recommender system. In view of the growing data points, I propose a latent factor model that caters to both accuracy and efficiency by reducing the number of latent features of either users or items making it less complex than Funk-SVD, where latent features of both users and items are equal and often larger. A comprehensive empirical evaluation of accuracy on two publicly available, amazon and ml-100 k datasets reveals the comparable accuracy and lesser complexity of proposed methods than Funk-SVD.

  15. Netflix Movie Ratings

    • kaggle.com
    Updated Dec 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luis Heitor Ribeiro (2024). Netflix Movie Ratings [Dataset]. https://www.kaggle.com/datasets/luisheitorribeiro/netflix-movie-ratings/discussion?sort=undefined
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 9, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Luis Heitor Ribeiro
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This is a reduced dataset from a much larger Netflix's movie ratings database, for use in collaborative filtering, recommendation systems, and related applications.

    Any particular user has rated only a fraction of the movies, so the data matrix is only partially filled. The goal here is to fill all the remaining entries of the matrix, and then compare with the complete test matrix.

  16. Amazon Prime TV Shows and Movies

    • kaggle.com
    Updated May 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victor Soeiro (2022). Amazon Prime TV Shows and Movies [Dataset]. https://www.kaggle.com/datasets/victorsoeiro/amazon-prime-tv-shows-and-movies/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 14, 2022
    Dataset provided by
    Kaggle
    Authors
    Victor Soeiro
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Amazon Prime - Movies and TV Dramas

    This data set was created to list all shows available on Amazon Prime streaming, and analyze the data to find interesting facts. This data was acquired in May 2022 containing data available in the United States.

    Content

    This dataset has two files containing the titles (titles.csv) and the cast (credits.csv) for the title.

    This dataset contains +9k unique titles on Amazon Prime with 15 columns containing their information, including:

    • id: The title ID on JustWatch.
    • title: The name of the title.
    • show_type: TV show or movie.
    • description: A brief description.
    • release_year: The release year.
    • age_certification: The age certification.
    • runtime: The length of the episode (SHOW) or movie.
    • genres: A list of genres.
    • production_countries: A list of countries that produced the title.
    • seasons: Number of seasons if it's a SHOW.
    • imdb_id: The title ID on IMDB.
    • imdb_score: Score on IMDB.
    • imdb_votes: Votes on IMDB.
    • tmdb_popularity: Popularity on TMDB.
    • tmdb_score: Score on TMDB.

    And over +124k credits of actors and directors on Amazon Prime titles with 5 columns containing their information:

    • person_ID: The person ID on JustWatch.
    • id: The title ID on JustWatch.
    • name: The actor or director's name.
    • character_name: The character name.
    • role: ACTOR or DIRECTOR.

    Tasks

    • Developing a content-based recommender system using the genres and/or descriptions.
    • Identifying the main content available on the streaming.
    • Network analysis on the cast of the titles.
    • Exploratory data analysis to find interesting insights.

    Other Streaming Datasets

    How to obtain the data

    If you want to see how I obtained these data, please check my GitHub repository.

    Acknowledgements

    All data were collected from JustWatch.

  17. Disney+ TV Shows and Movies

    • kaggle.com
    Updated May 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victor Soeiro (2022). Disney+ TV Shows and Movies [Dataset]. https://www.kaggle.com/victorsoeiro/disney-tv-shows-and-movies/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 13, 2022
    Dataset provided by
    Kaggle
    Authors
    Victor Soeiro
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Disney+ - TV Shows and Movies

    This data set was created to list all shows available on Disney+ streaming, and analyze the data to find interesting facts. This data was acquired in May 2022 containing data available in the United States.

    Content

    This dataset has two files containing the titles (titles.csv) and the cast (credits.csv) for the title.

    This dataset contains +1500 unique titles on Disney+ with 15 columns containing their information, including:

    • id: The title ID on JustWatch.
    • title: The name of the title.
    • show_type: TV show or movie.
    • description: A brief description.
    • release_year: The release year.
    • age_certification: The age certification.
    • runtime: The length of the episode (SHOW) or movie.
    • genres: A list of genres.
    • production_countries: A list of countries that produced the title.
    • seasons: Number of seasons if it's a SHOW.
    • imdb_id: The title ID on IMDB.
    • imdb_score: Score on IMDB.
    • imdb_votes: Votes on IMDB.
    • tmdb_popularity: Popularity on TMDB.
    • tmdb_score: Score on TMDB.

    And over +26k credits of actors and directors on Disney+ titles with 5 columns containing their information, including:

    • person_ID: The person ID on JustWatch.
    • id: The title ID on JustWatch.
    • name: The actor or director's name.
    • character_name: The character name.
    • role: ACTOR or DIRECTOR.

    Tasks

    • Developing a content-based recommender system using the genres and/or descriptions.
    • Identifying the main content available on the streaming.
    • Network analysis on the cast of the titles.
    • Exploratory data analysis to find interesting insights.

    Other Streaming Datasets

    How to obtain the data

    If you want to see how I obtained these data, please check my GitHub repository.

    Acknowledgements

    All data were collected from JustWatch.

  18. 350 000+ movies from themoviedb.org

    • kaggle.com
    zip
    Updated Oct 12, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephanerappeneau (2017). 350 000+ movies from themoviedb.org [Dataset]. https://www.kaggle.com/stephanerappeneau/350-000-movies-from-themoviedborg
    Explore at:
    zip(70483259 bytes)Available download formats
    Dataset updated
    Oct 12, 2017
    Authors
    Stephanerappeneau
    Description

    Context

    I love movies.

    I tend to avoid marvel-transformers-standardized products, and prefer a mix of classic hollywood-golden-age and obscure polish artsy movies. Throw in an occasional japanese-zombie-slasher-giallo as an alibi. Good movies don't exist without bad movies.

    On average I watch 200+ movies each year, with peaks at more than 500 movies. Nine years ago I started to log my movies to avoid watching the same movie twice, and also assign scores. Over the years, it gave me a couple insights on my viewing habits but nothing more than what a tenth-grader would learn at school.

    I've recently suscribed to Netflix and it pains me to see the global inefficiency of recommendation systems for people like me, who mostly swear by "La politique des auteurs". It's a term coined by famous new-wave french movie critic André Bazin, meaning that the quality of a movie is essentially linked to the director and it's capacity to execute his vision with his crew. We could debate it depends on movie production pipeline, but let's not for now. Practically, what it means, is that I essentially watch movies from directors who made films I've liked.

    I suspect Neflix calibrate their recommandation models taking into account the way the "average-joe" chooses a movie. A few months ago I had read a study based on a survey, showing that people chose a movie mostly based on genre (55%), then by leading actors (45%). Director or Release Date were far behind around 10% each. It is not surprising, since most people I know don't care who the director is. Lots of US blockbusters don't even mention it on the movie poster. I am aware that collaborative filtering is based on user proximity , which I believe decreases (or even eliminates) the need to characterize a movie. So here I'm more interested in content based filtering which is based on product proximity for several reasons :

    • Users tastes are not easily accessible. It is, after all, Netflix treasure chest

    • Movie offer on Netflix is so bad for someone who likes author's movies that it wouldn't help

    • Modeling a movie intrinsic qualities is a nice challenge

    Enough.

    "*The secret of getting ahead is getting started*" (Mark Twain)

    https://img11.hostingpics.net/pics/117765networkgraph.png" alt="network graph">

    Content

    The primary source is www.themoviedb.org. If you watch obscure artsy romanian homemade movies you may find only 95% of your movies referenced...but for anyone else it should be in the 98%+ range.

    Here is overview of the available sources that I've tried :

    • Imdb.com free csv dumps (ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/temporaryaccess/) are badly documented, incomplete, loosely structured and impossible to join/merge. There's an API hosted by Amazon Web Service : 1€ every 100 000 requests. With around 1 million movies, it could become expensive also features are bare. So I've searched for other sources.

    www.themoviedb.org is based on crowdsourcing and has an excellent API, limited to 40 requests every 10 seconds. It is quite generous, well documented, and enough to sweep the 450 000 movies in a few days. For my purpose, data quality is not significantly worse than imdb, and as imdb key is also included there's always the possibility to complete my dataset later (I actually did it)

    www.Boxofficemojo.com has some interesting budget/revenue figures (which are sorely lacking in both imdb & tmdb), but it actually tracks only a few thousand movies, mainly blockbusters. There are other professional sources that are used by film industry to get better predictive / marketing insights but that's beyond my reach for this experiment.

    www.wikipedia.com is an interesting source with no real cap on API calls, however it requires a bit of webscraping and for movies or directors the layout and quality varies a lot, so I suspected it'd get a lot of work to get insights so I put this source in lower priority.

    www.google.com will ban you after a few minutes of web scraping because their job is to scrap data from others, than sell it, duh.

    • It's worth mentionning that there are a few dumps of Netflix anonymized user tastes on kaggle, because they've organised a few competitions to improve their recommendation models. https://www.kaggle.com/netflix-inc/netflix-prize-data

    • Online databases are largely white anglo-saxon centric, meaning bollywood (India is the 2nd bigger producer of movies) offer is mostly absent from datasets. I'm fine with that, as it's not my cup of tea plus I lack domain knowledge. The sheer amount of indian movies would probably skew my results anyway (I don't want to have too many martial-arts-musicals in my recommendations ;-)). I have, however, tremendous respect for indian movie industry so I'd love to collaborate with an indian cinephile ! https://img11.hostingpics.net/pics/340226westerns.png" alt="Westerns">

    Inspiration

    Starting from there, I had multiple problem statements for both supervised / unsupervised machine learning

    • Can I program a tailored-recommendation system based on my own criteria ?

    • What are the characteristics of movies/directors I like the most ?

    • What is the probability that I will like my next movie ?

    • Can I find the data ?

    One of the objectives of sharing my work here is to find cinephile data-scientists who might be interested and, hopefully, contribute or share insights :) Other interesting leads : use tagline for NLP/Clustering/Genre guessing, leverage on budget/revenue, link with other data sources using the imdb normalized title, etc.

    https://img11.hostingpics.net/pics/977004matrice.png" alt="Correlation matrix">

    Motivation, Disclaimer and Acknowledgements

    • I've graduated from an french engineering school, majoring in artificial intelligence, but that was 17 years ago right in the middle of A.I-winter. Like a lot of white male rocket scientists, I've ended up in one of the leading european investment bank, quickly abandonning IT development to specialize in trading/risk project management and internal politics. My recent appointment in the Data Office made me aware of recent breakthroughts in datascience, and I thought that developing a side project would be an excellent occasion to learn something new. Plus it'd give me a well-needed credibility which too often lack decision makers when it comes to datascience.

    • I've worked on some of the features with Cédric Paternotte, a fellow friend of mine who is a professor of philosophy of sciences in La Sorbonne. Working with someone with a different background seem a good idea for motivation, creativity and rigor.

    • Kudos to www.themoviedb.org or www.wikipedia.com sites, who really have a great attitude towards open data. This is typically NOT the case of modern-bigdata companies who mostly keep data to themselves to try to monetize it. Such a huge contrast with imdb or instagram API, which generously let you grab your last 3 comments at a miserable rate. Even if 15 years ago this seemed a mandatory path to get services for free, I predict one day governments will need to break this data monopoly.

    [Disclaimer : I apologize in advance for my engrish (I'm french ^-^), any bad-code I've written (there are probably hundreds of way to do it better and faster), any pseudo-scientific assumption I've made, I'm slowly getting back in statistics and lack senior guidance, one day I regress a non-stationary time series and the day after I'll discover I shouldn't have, and any incorrect use of machine-learning models]

    https://img11.hostingpics.net/pics/898068408x161poweredbyrectanglegreen.png" alt="powered by themoviedb.org">

  19. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ritik Kumar (2024). Netflix Recommendation Engine Dataset [Dataset]. https://www.kaggle.com/datasets/ritikkumar38/netflix-recommendation-engine-dataset
Organization logo

Netflix Recommendation Engine Dataset

Explore at:
100 scholarly articles cite this dataset (View in Google Scholar)
zip(0 bytes)Available download formats
Dataset updated
Mar 28, 2024
Authors
Ritik Kumar
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Dataset

This dataset was created by Ritik Kumar

Released under Apache 2.0

Contents

Search
Clear search
Close search
Google apps
Main menu