17 datasets found
  1. c

    IMDB movie details dataset

    • crawlfeeds.com
    csv, zip
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). IMDB movie details dataset [Dataset]. https://crawlfeeds.com/datasets/imdb-movie-details-dataset
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Jul 5, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description
    The IMDB Movie Details Dataset is a comprehensive collection of movie datasets that offers a treasure trove of information about movies, TV shows, and streaming content listed on IMDB. This dataset includes detailed data such as titles, release years, genres, cast, crew, ratings, and more, making it a go-to resource for film and entertainment enthusiasts. Ideal for data analysis, IMDB movie dataset applications span machine learning projects, predictive modeling, and insights into industry trends.
    Researchers can explore patterns in movie ratings and genre popularity, while developers can use the dataset to build recommendation systems or applications. Movie buffs can dive deep into historical and contemporary trends in the world of cinema. This dataset not only supports academic and professional pursuits but also opens doors for creative projects in storytelling, content creation, and audience engagement. Whether you’re a developer, researcher, or film enthusiast, the IMDB movie dataset is a powerful tool for uncovering trends and gaining deeper insights into the evolving entertainment landscape.
  2. IMDB Dataset For Machine Learning

    • kaggle.com
    Updated Sep 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KHUSHI YADAV (2023). IMDB Dataset For Machine Learning [Dataset]. https://www.kaggle.com/datasets/khushiyadav2022/imdb-dataset-for-machine-learning
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 25, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    KHUSHI YADAV
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    "Movie Recommendation on the IMDB Dataset: A Journey into Machine Learning" is an exciting project focused on leveraging the IMDB Dataset for developing an advanced movie recommendation system. This project aims to explore the vast potential of machine learning techniques in providing personalized movie recommendations to users.

    The IMDB Dataset, comprising a wealth of movie information including genres, ratings, and user reviews, serves as the foundation for this project. By harnessing the power of machine learning algorithms and data analysis, the project seeks to build a recommendation system that can accurately suggest movies tailored to each individual's preferences.

  3. h

    Data from: imdb

    • huggingface.co
    Updated Aug 3, 2003
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford NLP (2003). imdb [Dataset]. https://huggingface.co/datasets/stanfordnlp/imdb
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 3, 2003
    Dataset authored and provided by
    Stanford NLP
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for "imdb"

      Dataset Summary
    

    Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

      Supported Tasks and Leaderboards
    

    More Information Needed

      Languages
    

    More Information Needed

      Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
    
  4. IMDb Top Rated English Movies

    • kaggle.com
    Updated Nov 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex Quilis (2023). IMDb Top Rated English Movies [Dataset]. https://www.kaggle.com/datasets/alexq1111/imdb-top-rated-english-movies/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 10, 2023
    Dataset provided by
    Kaggle
    Authors
    Alex Quilis
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    I scraped data from IMDb to create a dataset of top-rated English movies. It includes movie names, release years, ratings, and user votes. The goal is to provide a valuable resource for movie enthusiasts and data analysts.

    Sources: The data comes directly from IMDb, a popular movie information platform. I used web scraping to extract details from IMDb pages, ensuring the dataset is accurate and comprehensive.

    Educational Intent: The entire data collection effort was driven by educational purposes, aiming to provide a curated dataset for analysis and exploration. Users are encouraged to leverage the dataset for educational and non-commercial purposes while being mindful of IMDb's terms of service.

    Inspiration for Skill Improvement: This project helped me improve my web scraping skills, especially in navigating HTML structures and handling data extraction. I also honed my data cleaning and preprocessing abilities to ensure the dataset's quality. Analyzing and visualizing the data further improved my data analysis skills. Overall, this practical project enhanced my proficiency in handling real-world datasets.

  5. h

    autotrain-data-imdb-sentiment-analysis

    • huggingface.co
    Updated Aug 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Feng Peng (2023). autotrain-data-imdb-sentiment-analysis [Dataset]. https://huggingface.co/datasets/linktimecloud/autotrain-data-imdb-sentiment-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 8, 2023
    Authors
    Feng Peng
    Description

    AutoTrain Dataset for project: imdb-sentiment-analysis

      Dataset Description
    

    This dataset has been automatically processed by AutoTrain for project imdb-sentiment-analysis.

      Languages
    

    The BCP-47 code for the dataset's language is en.

      Dataset Structure
    
    
    
    
    
    
    
      Data Instances
    

    A sample from this dataset looks as follows: [ { "text": "Me neither, but this flick is unfortunately one of those movies that are too bad to be good and… See the full description on the dataset page: https://huggingface.co/datasets/linktimecloud/autotrain-data-imdb-sentiment-analysis.

  6. h

    autoeval-staging-eval-project-imdb-ed2a920e-12445656

    • huggingface.co
    Updated Aug 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evaluation on the Hub (2023). autoeval-staging-eval-project-imdb-ed2a920e-12445656 [Dataset]. https://huggingface.co/datasets/autoevaluate/autoeval-staging-eval-project-imdb-ed2a920e-12445656
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 23, 2023
    Dataset authored and provided by
    Evaluation on the Hub
    Description

    Dataset Card for AutoTrain Evaluator

    This repository contains model predictions generated by AutoTrain for the following task and dataset:

    Task: Binary Text Classification Model: lvwerra/distilbert-imdb Dataset: imdb Config: plain_text Split: test

    To run new evaluation jobs, visit Hugging Face's automatic model evaluator.

      Contributions
    

    Thanks to @lvwerra for evaluating this model.

  7. IMDB Rating BeautifulSoup Project

    • kaggle.com
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pawan Kumar (2023). IMDB Rating BeautifulSoup Project [Dataset]. https://www.kaggle.com/datasets/pawankumar19/imdb-rating-beautifulsoup-project/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 23, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Pawan Kumar
    Description

    Dataset

    This dataset was created by Pawan Kumar

    Contents

  8. c

    Amazon prime tv shows and movies dataset

    • crawlfeeds.com
    csv, zip
    Updated Jul 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Amazon prime tv shows and movies dataset [Dataset]. https://crawlfeeds.com/datasets/amazon-prime-tv-shows-and-movies-dataset
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Jul 4, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Amazon Prime TV Shows and Movies Dataset offered by Crawl Feeds is an extensive resource containing over 92,000 records in JSON format. This dataset encompasses a wide array of data points, including links, titles, descriptions, release dates, genres, posters, streaming platforms, countries, number of seasons, content ratings, IMDb ratings, cast and crew details, unique identifiers, and scraping timestamps. Such comprehensive information is invaluable for researchers, data analysts, and developers aiming to conduct in-depth analyses, develop recommendation systems, or explore trends within Amazon Prime's content library.

    For those interested in broader media datasets, Crawl Feeds also offers the Movies and TV Shows Dataset, which includes 118,000 records, and the IMDb Movie Details Dataset, comprising 250,000 records. These datasets provide extensive information across various platforms, facilitating comparative studies and cross-platform analyses.

    Integrating these datasets into your projects can significantly enhance the depth and quality of your analyses, providing a robust foundation for exploring various facets of the entertainment industry. Whether you're developing a new application, conducting market research, or performing academic studies, these datasets serve as a valuable resource for gaining insights into the dynamic world of streaming media.

    Explore the Amazon Prime TV Shows and Movies Dataset and other related datasets on Crawl Feeds to elevate your data-driven projects.

  9. the_movies_dataset

    • kaggle.com
    zip
    Updated Jun 19, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sezgin ildes (2021). the_movies_dataset [Dataset]. https://www.kaggle.com/sezginildes/the-movies-dataset
    Explore at:
    zip(15456686 bytes)Available download formats
    Dataset updated
    Jun 19, 2021
    Authors
    sezgin ildes
    Description

    Context These files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages.

    This dataset also has files containing 26 million ratings from 270,000 users for all 45,000 movies. Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website.

    Content This dataset consists of the following files:

    movies_metadata.csv: The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies.

    keywords.csv: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.

    credits.csv: Consists of Cast and Crew Information for all our movies. Available in the form of a stringified JSON Object.

    links.csv: The file that contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.

    links_small.csv: Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset.

    ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies.

    The Full MovieLens Dataset consisting of 26 million ratings and 750,000 tag applications from 270,000 users on all the 45,000 movies in this dataset can be accessed here

    Acknowledgements This dataset is an ensemble of data collected from TMDB and GroupLens. The Movie Details, Credits and Keywords have been collected from the TMDB Open API. This product uses the TMDb API but is not endorsed or certified by TMDb. Their API also provides access to data on many additional movies, actors and actresses, crew members, and TV shows. You can try it for yourself here.

    The Movie Links and Ratings have been obtained from the Official GroupLens website. The files are a part of the dataset available here

    Inspiration This dataset was assembled as part of my second Capstone Project for Springboard's Data Science Career Track. I wanted to perform an extensive EDA on Movie Data to narrate the history and the story of Cinema and use this metadata in combination with MovieLens ratings to build various types of Recommender Systems.

    Both my notebooks are available as kernels with this dataset: The Story of Film and Movie Recommender Systems

    Some of the things you can do with this dataset: Predicting movie revenue and/or movie success based on a certain metric. What movies tend to get higher vote counts and vote averages on TMDB? Building Content Based and Collaborative Filtering Based Recommendation Engines.

  10. h

    autoeval-staging-eval-project-imdb-17316918-12425654

    • huggingface.co
    Updated Nov 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evaluation on the Hub (2023). autoeval-staging-eval-project-imdb-17316918-12425654 [Dataset]. https://huggingface.co/datasets/autoevaluate/autoeval-staging-eval-project-imdb-17316918-12425654
    Explore at:
    Dataset updated
    Nov 2, 2023
    Dataset authored and provided by
    Evaluation on the Hub
    Description

    autoevaluate/autoeval-staging-eval-project-imdb-17316918-12425654 dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. Amazon Prime TV Shows and Movies

    • kaggle.com
    Updated May 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victor Soeiro (2022). Amazon Prime TV Shows and Movies [Dataset]. https://www.kaggle.com/datasets/victorsoeiro/amazon-prime-tv-shows-and-movies/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 14, 2022
    Dataset provided by
    Kaggle
    Authors
    Victor Soeiro
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Amazon Prime - Movies and TV Dramas

    This data set was created to list all shows available on Amazon Prime streaming, and analyze the data to find interesting facts. This data was acquired in May 2022 containing data available in the United States.

    Content

    This dataset has two files containing the titles (titles.csv) and the cast (credits.csv) for the title.

    This dataset contains +9k unique titles on Amazon Prime with 15 columns containing their information, including:

    • id: The title ID on JustWatch.
    • title: The name of the title.
    • show_type: TV show or movie.
    • description: A brief description.
    • release_year: The release year.
    • age_certification: The age certification.
    • runtime: The length of the episode (SHOW) or movie.
    • genres: A list of genres.
    • production_countries: A list of countries that produced the title.
    • seasons: Number of seasons if it's a SHOW.
    • imdb_id: The title ID on IMDB.
    • imdb_score: Score on IMDB.
    • imdb_votes: Votes on IMDB.
    • tmdb_popularity: Popularity on TMDB.
    • tmdb_score: Score on TMDB.

    And over +124k credits of actors and directors on Amazon Prime titles with 5 columns containing their information:

    • person_ID: The person ID on JustWatch.
    • id: The title ID on JustWatch.
    • name: The actor or director's name.
    • character_name: The character name.
    • role: ACTOR or DIRECTOR.

    Tasks

    • Developing a content-based recommender system using the genres and/or descriptions.
    • Identifying the main content available on the streaming.
    • Network analysis on the cast of the titles.
    • Exploratory data analysis to find interesting insights.

    Other Streaming Datasets

    How to obtain the data

    If you want to see how I obtained these data, please check my GitHub repository.

    Acknowledgements

    All data were collected from JustWatch.

  12. D

    Cinema Context in RDF

    • ssh.datastations.nl
    • datacatalogue.cessda.eu
    bin, css, jpeg, json +5
    Updated Nov 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M. den Engelse; L. van Wissen; L. van Wissen; T. van Oort; T. van Oort; J. Noordegraaf; J. Noordegraaf; M. den Engelse (2020). Cinema Context in RDF [Dataset]. http://doi.org/10.17026/DANS-Z64-MRVB
    Explore at:
    css(1537), bin(3039), bin(2458), bin(5132), txt(99), text/markdown(22410), png(16829), png(292232), text/markdown(309), text/markdown(3040), bin(25150019), png(207430), bin(3392319), bin(4748), text/markdown(4949), png(392403), text/markdown(564), bin(2404), text/markdown(8975), text/markdown(14139), text/markdown(1431), text/markdown(113), text/markdown(57462), bin(684), bin(364177), json(47673), bin(3422), bin(482625), text/markdown(1326), bin(4483), bin(1649), bin(2832), bin(25), bin(3880), bin(1758), bin(18), text/plain; charset=us-ascii(95), bin(40155), bin(81921327), bin(8584), text/markdown(1738), png(4666), bin(4991), bin(3186), text/markdown(437), bin(44379358), bin(2139), bin(739995), text/markdown(2761), zip(44277), bin(6294), text/plain; charset=us-ascii(1206), jpeg(75203), bin(497163), text/markdown(28905), zip(9787798), bin(285), bin(1596394), text/markdown(251)Available download formats
    Dataset updated
    Nov 9, 2020
    Dataset provided by
    DANS Data Station Social Sciences and Humanities
    Authors
    M. den Engelse; L. van Wissen; L. van Wissen; T. van Oort; T. van Oort; J. Noordegraaf; J. Noordegraaf; M. den Engelse
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cinema Context is an online MySQL database containing places, persons and companies involved in more than 100,000 film screenings since 1895. CC provides insight into the ‘DNA’ of Dutch film and cinema culture and is praised by film historians worldwide. With a DANS Small Data Project grant, this data set has been converted to a Linked Data format (RDF). This data deposit contains both the RDF data set and the script used to convert the MySQL database into RDF.

  13. Will They or Won't They Couples Data

    • figshare.com
    txt
    Updated May 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarah Lotspeich; Ashley Mullan (2024). Will They or Won't They Couples Data [Dataset]. http://doi.org/10.6084/m9.figshare.24456844.v3
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 16, 2024
    Dataset provided by
    figshare
    Authors
    Sarah Lotspeich; Ashley Mullan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Many television shows follow the “will they or won’t they” trope, where the dynamic between a pair of main characters constantly shifts between friendship and something more throughout the run of the series. This trope has persisted throughout the decades, and examples include Sam and Diane from the 1980s show Cheers and Jess and Nick from the 2010s show New Girl. In some cases, the audience may wait multiple seasons before a couple like this gets together, and some suspect that producers delay the moment to create suspense and keep viewers engaged. Events marking major romantic milestones, such as the pair’s first kiss, often change the trajectory of the plot, influence the number of viewers tuning into the show, and drive up episode ratings. In this project, we scrape viewer ratings from the Internet Movie Database (IMDb) for 150 popular couples from 125 television series and then model the plot shifts following episodes with romantic milestones using causal inference methods. Specifically, we construct an interrupted time series model, where the interruption is the episode in which each couple has their first kiss. From this model, we assess whether these interruptions are associated with changes in viewer ratings on average.

  14. Disney+ TV Shows and Movies

    • kaggle.com
    Updated May 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victor Soeiro (2022). Disney+ TV Shows and Movies [Dataset]. https://www.kaggle.com/victorsoeiro/disney-tv-shows-and-movies/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 13, 2022
    Dataset provided by
    Kaggle
    Authors
    Victor Soeiro
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Disney+ - TV Shows and Movies

    This data set was created to list all shows available on Disney+ streaming, and analyze the data to find interesting facts. This data was acquired in May 2022 containing data available in the United States.

    Content

    This dataset has two files containing the titles (titles.csv) and the cast (credits.csv) for the title.

    This dataset contains +1500 unique titles on Disney+ with 15 columns containing their information, including:

    • id: The title ID on JustWatch.
    • title: The name of the title.
    • show_type: TV show or movie.
    • description: A brief description.
    • release_year: The release year.
    • age_certification: The age certification.
    • runtime: The length of the episode (SHOW) or movie.
    • genres: A list of genres.
    • production_countries: A list of countries that produced the title.
    • seasons: Number of seasons if it's a SHOW.
    • imdb_id: The title ID on IMDB.
    • imdb_score: Score on IMDB.
    • imdb_votes: Votes on IMDB.
    • tmdb_popularity: Popularity on TMDB.
    • tmdb_score: Score on TMDB.

    And over +26k credits of actors and directors on Disney+ titles with 5 columns containing their information, including:

    • person_ID: The person ID on JustWatch.
    • id: The title ID on JustWatch.
    • name: The actor or director's name.
    • character_name: The character name.
    • role: ACTOR or DIRECTOR.

    Tasks

    • Developing a content-based recommender system using the genres and/or descriptions.
    • Identifying the main content available on the streaming.
    • Network analysis on the cast of the titles.
    • Exploratory data analysis to find interesting insights.

    Other Streaming Datasets

    How to obtain the data

    If you want to see how I obtained these data, please check my GitHub repository.

    Acknowledgements

    All data were collected from JustWatch.

  15. 350 000+ movies from themoviedb.org

    • kaggle.com
    zip
    Updated Oct 12, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephanerappeneau (2017). 350 000+ movies from themoviedb.org [Dataset]. https://www.kaggle.com/stephanerappeneau/350-000-movies-from-themoviedborg
    Explore at:
    zip(70483259 bytes)Available download formats
    Dataset updated
    Oct 12, 2017
    Authors
    Stephanerappeneau
    Description

    Context

    I love movies.

    I tend to avoid marvel-transformers-standardized products, and prefer a mix of classic hollywood-golden-age and obscure polish artsy movies. Throw in an occasional japanese-zombie-slasher-giallo as an alibi. Good movies don't exist without bad movies.

    On average I watch 200+ movies each year, with peaks at more than 500 movies. Nine years ago I started to log my movies to avoid watching the same movie twice, and also assign scores. Over the years, it gave me a couple insights on my viewing habits but nothing more than what a tenth-grader would learn at school.

    I've recently suscribed to Netflix and it pains me to see the global inefficiency of recommendation systems for people like me, who mostly swear by "La politique des auteurs". It's a term coined by famous new-wave french movie critic André Bazin, meaning that the quality of a movie is essentially linked to the director and it's capacity to execute his vision with his crew. We could debate it depends on movie production pipeline, but let's not for now. Practically, what it means, is that I essentially watch movies from directors who made films I've liked.

    I suspect Neflix calibrate their recommandation models taking into account the way the "average-joe" chooses a movie. A few months ago I had read a study based on a survey, showing that people chose a movie mostly based on genre (55%), then by leading actors (45%). Director or Release Date were far behind around 10% each. It is not surprising, since most people I know don't care who the director is. Lots of US blockbusters don't even mention it on the movie poster. I am aware that collaborative filtering is based on user proximity , which I believe decreases (or even eliminates) the need to characterize a movie. So here I'm more interested in content based filtering which is based on product proximity for several reasons :

    • Users tastes are not easily accessible. It is, after all, Netflix treasure chest

    • Movie offer on Netflix is so bad for someone who likes author's movies that it wouldn't help

    • Modeling a movie intrinsic qualities is a nice challenge

    Enough.

    "*The secret of getting ahead is getting started*" (Mark Twain)

    https://img11.hostingpics.net/pics/117765networkgraph.png" alt="network graph">

    Content

    The primary source is www.themoviedb.org. If you watch obscure artsy romanian homemade movies you may find only 95% of your movies referenced...but for anyone else it should be in the 98%+ range.

    Here is overview of the available sources that I've tried :

    • Imdb.com free csv dumps (ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/temporaryaccess/) are badly documented, incomplete, loosely structured and impossible to join/merge. There's an API hosted by Amazon Web Service : 1€ every 100 000 requests. With around 1 million movies, it could become expensive also features are bare. So I've searched for other sources.

    www.themoviedb.org is based on crowdsourcing and has an excellent API, limited to 40 requests every 10 seconds. It is quite generous, well documented, and enough to sweep the 450 000 movies in a few days. For my purpose, data quality is not significantly worse than imdb, and as imdb key is also included there's always the possibility to complete my dataset later (I actually did it)

    www.Boxofficemojo.com has some interesting budget/revenue figures (which are sorely lacking in both imdb & tmdb), but it actually tracks only a few thousand movies, mainly blockbusters. There are other professional sources that are used by film industry to get better predictive / marketing insights but that's beyond my reach for this experiment.

    www.wikipedia.com is an interesting source with no real cap on API calls, however it requires a bit of webscraping and for movies or directors the layout and quality varies a lot, so I suspected it'd get a lot of work to get insights so I put this source in lower priority.

    www.google.com will ban you after a few minutes of web scraping because their job is to scrap data from others, than sell it, duh.

    • It's worth mentionning that there are a few dumps of Netflix anonymized user tastes on kaggle, because they've organised a few competitions to improve their recommendation models. https://www.kaggle.com/netflix-inc/netflix-prize-data

    • Online databases are largely white anglo-saxon centric, meaning bollywood (India is the 2nd bigger producer of movies) offer is mostly absent from datasets. I'm fine with that, as it's not my cup of tea plus I lack domain knowledge. The sheer amount of indian movies would probably skew my results anyway (I don't want to have too many martial-arts-musicals in my recommendations ;-)). I have, however, tremendous respect for indian movie industry so I'd love to collaborate with an indian cinephile ! https://img11.hostingpics.net/pics/340226westerns.png" alt="Westerns">

    Inspiration

    Starting from there, I had multiple problem statements for both supervised / unsupervised machine learning

    • Can I program a tailored-recommendation system based on my own criteria ?

    • What are the characteristics of movies/directors I like the most ?

    • What is the probability that I will like my next movie ?

    • Can I find the data ?

    One of the objectives of sharing my work here is to find cinephile data-scientists who might be interested and, hopefully, contribute or share insights :) Other interesting leads : use tagline for NLP/Clustering/Genre guessing, leverage on budget/revenue, link with other data sources using the imdb normalized title, etc.

    https://img11.hostingpics.net/pics/977004matrice.png" alt="Correlation matrix">

    Motivation, Disclaimer and Acknowledgements

    • I've graduated from an french engineering school, majoring in artificial intelligence, but that was 17 years ago right in the middle of A.I-winter. Like a lot of white male rocket scientists, I've ended up in one of the leading european investment bank, quickly abandonning IT development to specialize in trading/risk project management and internal politics. My recent appointment in the Data Office made me aware of recent breakthroughts in datascience, and I thought that developing a side project would be an excellent occasion to learn something new. Plus it'd give me a well-needed credibility which too often lack decision makers when it comes to datascience.

    • I've worked on some of the features with Cédric Paternotte, a fellow friend of mine who is a professor of philosophy of sciences in La Sorbonne. Working with someone with a different background seem a good idea for motivation, creativity and rigor.

    • Kudos to www.themoviedb.org or www.wikipedia.com sites, who really have a great attitude towards open data. This is typically NOT the case of modern-bigdata companies who mostly keep data to themselves to try to monetize it. Such a huge contrast with imdb or instagram API, which generously let you grab your last 3 comments at a miserable rate. Even if 15 years ago this seemed a mandatory path to get services for free, I predict one day governments will need to break this data monopoly.

    [Disclaimer : I apologize in advance for my engrish (I'm french ^-^), any bad-code I've written (there are probably hundreds of way to do it better and faster), any pseudo-scientific assumption I've made, I'm slowly getting back in statistics and lack senior guidance, one day I regress a non-stationary time series and the day after I'll discover I shouldn't have, and any incorrect use of machine-learning models]

    https://img11.hostingpics.net/pics/898068408x161poweredbyrectanglegreen.png" alt="powered by themoviedb.org">

  16. Ten Thousand German News Articles Dataset

    • kaggle.com
    • tblock.github.io
    zip
    Updated Jan 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timo Block (2022). Ten Thousand German News Articles Dataset [Dataset]. https://www.kaggle.com/tblock/10kgnad
    Explore at:
    zip(21144764 bytes)Available download formats
    Dataset updated
    Jan 20, 2022
    Authors
    Timo Block
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    (see https://tblock.github.io/10kGNAD/ for the original dataset page)

    This page introduces the 10k German News Articles Dataset (10kGNAD) german topic classification dataset. The 10kGNAD is based on the One Million Posts Corpus and avalaible under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. You can download the dataset here.

    Why a German dataset?

    English text classification datasets are common. Examples are the big AG News, the class-rich 20 Newsgroups and the large-scale DBpedia ontology datasets for topic classification and for example the commonly used IMDb and Yelp datasets for sentiment analysis. Non-english datasets, especially German datasets, are less common. There is a collection of sentiment analysis datasets assembled by the Interest Group on German Sentiment Analysis. However, to my knowlege, no german topic classification dataset is avaliable to the public.

    Due to grammatical differences between the English and the German language, a classifyer might be effective on a English dataset, but not as effectiv on a German dataset. The German language has a higher inflection and long compound words are quite common compared to the English language. One would need to evaluate a classifyer on multiple German datasets to get a sense of it's effectivness.

    The dataset

    The 10kGNAD dataset is intended to solve part of this problem as the first german topic classification dataset. It consists of 10273 german language news articles from an austrian online newspaper categorized into nine topics. These articles are a till now unused part of the One Million Posts Corpus.

    In the One Million Posts Corpus each article has a topic path. For example Newsroom/Wirtschaft/Wirtschaftpolitik/Finanzmaerkte/Griechenlandkrise. The 10kGNAD uses the second part of the topic path, here Wirtschaft, as class label. In result the dataset can be used for multi-class classification.

    I created and used this dataset in my thesis to train and evaluate four text classifyers on the German language. By publishing the dataset I hope to support the advancement of tools and models for the German language. Additionally this dataset can be used as a benchmark dataset for german topic classification.

    Numbers and statistics

    As in most real-world datasets the class distribution of the 10kGNAD is not balanced. The biggest class Web consists of 1678, while the smalles class Kultur contains only 539 articles. However articles from the Web class have on average the fewest words, while artilces from the culture class have the second most words.

    Splitting into train and test

    I propose a stratifyed split of 10% for testing and the remaining articles for training. To use the dataset as a benchmark dataset, please used the train.csv and test.csv files located in the project root.

    Code

    Python scripts to extract the articles and split them into a train- and a testset avaliable in the code directory of this project. Make sure to install the requirements. The original corpus.sqlite3 is required to extract the articles (download here (compressed) or here (uncompressed)).

    License

    Creative Commons License

    This dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Please consider citing the authors of the One Million Post Corpus if you use the dataset.

  17. h

    HebrewMetaphors

    • huggingface.co
    Updated Oct 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technion Data and Knowledge Lab (2023). HebrewMetaphors [Dataset]. https://huggingface.co/datasets/tdklab/HebrewMetaphors
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 3, 2023
    Dataset authored and provided by
    Technion Data and Knowledge Lab
    Description

    Dataset Card for "HebrewMetaphors"

      Dataset Summary
    

    A common dataset for text classification task is IMDb. Large Movie Review Dataset. This is a dataset for binary sentiment classification. The first step in our project was to create a Hebrew dataset with an IMDB-like structure but different in that, in addition to the sentences we have, there will also be verb names, and a classification of whether the verb name is literal or metaphorical in the given sentence. Using an… See the full description on the dataset page: https://huggingface.co/datasets/tdklab/HebrewMetaphors.

  18. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Crawl Feeds (2025). IMDB movie details dataset [Dataset]. https://crawlfeeds.com/datasets/imdb-movie-details-dataset

IMDB movie details dataset

IMDB movie details dataset from imdb.com

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
zip, csvAvailable download formats
Dataset updated
Jul 5, 2025
Dataset authored and provided by
Crawl Feeds
License

https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

Description
The IMDB Movie Details Dataset is a comprehensive collection of movie datasets that offers a treasure trove of information about movies, TV shows, and streaming content listed on IMDB. This dataset includes detailed data such as titles, release years, genres, cast, crew, ratings, and more, making it a go-to resource for film and entertainment enthusiasts. Ideal for data analysis, IMDB movie dataset applications span machine learning projects, predictive modeling, and insights into industry trends.
Researchers can explore patterns in movie ratings and genre popularity, while developers can use the dataset to build recommendation systems or applications. Movie buffs can dive deep into historical and contemporary trends in the world of cinema. This dataset not only supports academic and professional pursuits but also opens doors for creative projects in storytelling, content creation, and audience engagement. Whether you’re a developer, researcher, or film enthusiast, the IMDB movie dataset is a powerful tool for uncovering trends and gaining deeper insights into the evolving entertainment landscape.
Search
Clear search
Close search
Google apps
Main menu