https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
"Movie Recommendation on the IMDB Dataset: A Journey into Machine Learning" is an exciting project focused on leveraging the IMDB Dataset for developing an advanced movie recommendation system. This project aims to explore the vast potential of machine learning techniques in providing personalized movie recommendations to users.
The IMDB Dataset, comprising a wealth of movie information including genres, ratings, and user reviews, serves as the foundation for this project. By harnessing the power of machine learning algorithms and data analysis, the project seeks to build a recommendation system that can accurately suggest movies tailored to each individual's preferences.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for "imdb"
Dataset Summary
Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
Supported Tasks and Leaderboards
More Information Needed
Languages
More Information Needed
Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
I scraped data from IMDb to create a dataset of top-rated English movies. It includes movie names, release years, ratings, and user votes. The goal is to provide a valuable resource for movie enthusiasts and data analysts.
Sources: The data comes directly from IMDb, a popular movie information platform. I used web scraping to extract details from IMDb pages, ensuring the dataset is accurate and comprehensive.
Educational Intent: The entire data collection effort was driven by educational purposes, aiming to provide a curated dataset for analysis and exploration. Users are encouraged to leverage the dataset for educational and non-commercial purposes while being mindful of IMDb's terms of service.
Inspiration for Skill Improvement: This project helped me improve my web scraping skills, especially in navigating HTML structures and handling data extraction. I also honed my data cleaning and preprocessing abilities to ensure the dataset's quality. Analyzing and visualizing the data further improved my data analysis skills. Overall, this practical project enhanced my proficiency in handling real-world datasets.
AutoTrain Dataset for project: imdb-sentiment-analysis
Dataset Description
This dataset has been automatically processed by AutoTrain for project imdb-sentiment-analysis.
Languages
The BCP-47 code for the dataset's language is en.
Dataset Structure
Data Instances
A sample from this dataset looks as follows: [ { "text": "Me neither, but this flick is unfortunately one of those movies that are too bad to be good and… See the full description on the dataset page: https://huggingface.co/datasets/linktimecloud/autotrain-data-imdb-sentiment-analysis.
Dataset Card for AutoTrain Evaluator
This repository contains model predictions generated by AutoTrain for the following task and dataset:
Task: Binary Text Classification Model: lvwerra/distilbert-imdb Dataset: imdb Config: plain_text Split: test
To run new evaluation jobs, visit Hugging Face's automatic model evaluator.
Contributions
Thanks to @lvwerra for evaluating this model.
This dataset was created by Pawan Kumar
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Amazon Prime TV Shows and Movies Dataset offered by Crawl Feeds is an extensive resource containing over 92,000 records in JSON format. This dataset encompasses a wide array of data points, including links, titles, descriptions, release dates, genres, posters, streaming platforms, countries, number of seasons, content ratings, IMDb ratings, cast and crew details, unique identifiers, and scraping timestamps. Such comprehensive information is invaluable for researchers, data analysts, and developers aiming to conduct in-depth analyses, develop recommendation systems, or explore trends within Amazon Prime's content library.
For those interested in broader media datasets, Crawl Feeds also offers the Movies and TV Shows Dataset, which includes 118,000 records, and the IMDb Movie Details Dataset, comprising 250,000 records. These datasets provide extensive information across various platforms, facilitating comparative studies and cross-platform analyses.
Integrating these datasets into your projects can significantly enhance the depth and quality of your analyses, providing a robust foundation for exploring various facets of the entertainment industry. Whether you're developing a new application, conducting market research, or performing academic studies, these datasets serve as a valuable resource for gaining insights into the dynamic world of streaming media.
Explore the Amazon Prime TV Shows and Movies Dataset and other related datasets on Crawl Feeds to elevate your data-driven projects.
Context These files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages.
This dataset also has files containing 26 million ratings from 270,000 users for all 45,000 movies. Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website.
Content This dataset consists of the following files:
movies_metadata.csv: The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies.
keywords.csv: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.
credits.csv: Consists of Cast and Crew Information for all our movies. Available in the form of a stringified JSON Object.
links.csv: The file that contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.
links_small.csv: Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset.
ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies.
The Full MovieLens Dataset consisting of 26 million ratings and 750,000 tag applications from 270,000 users on all the 45,000 movies in this dataset can be accessed here
Acknowledgements This dataset is an ensemble of data collected from TMDB and GroupLens. The Movie Details, Credits and Keywords have been collected from the TMDB Open API. This product uses the TMDb API but is not endorsed or certified by TMDb. Their API also provides access to data on many additional movies, actors and actresses, crew members, and TV shows. You can try it for yourself here.
The Movie Links and Ratings have been obtained from the Official GroupLens website. The files are a part of the dataset available here
Inspiration This dataset was assembled as part of my second Capstone Project for Springboard's Data Science Career Track. I wanted to perform an extensive EDA on Movie Data to narrate the history and the story of Cinema and use this metadata in combination with MovieLens ratings to build various types of Recommender Systems.
Both my notebooks are available as kernels with this dataset: The Story of Film and Movie Recommender Systems
Some of the things you can do with this dataset: Predicting movie revenue and/or movie success based on a certain metric. What movies tend to get higher vote counts and vote averages on TMDB? Building Content Based and Collaborative Filtering Based Recommendation Engines.
autoevaluate/autoeval-staging-eval-project-imdb-17316918-12425654 dataset hosted on Hugging Face and contributed by the HF Datasets community
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This data set was created to list all shows available on Amazon Prime streaming, and analyze the data to find interesting facts. This data was acquired in May 2022 containing data available in the United States.
This dataset has two files containing the titles (titles.csv) and the cast (credits.csv) for the title.
This dataset contains +9k unique titles on Amazon Prime with 15 columns containing their information, including:
- id: The title ID on JustWatch.
- title: The name of the title.
- show_type: TV show or movie.
- description: A brief description.
- release_year: The release year.
- age_certification: The age certification.
- runtime: The length of the episode (SHOW) or movie.
- genres: A list of genres.
- production_countries: A list of countries that produced the title.
- seasons: Number of seasons if it's a SHOW.
- imdb_id: The title ID on IMDB.
- imdb_score: Score on IMDB.
- imdb_votes: Votes on IMDB.
- tmdb_popularity: Popularity on TMDB.
- tmdb_score: Score on TMDB.
And over +124k credits of actors and directors on Amazon Prime titles with 5 columns containing their information:
- person_ID: The person ID on JustWatch.
- id: The title ID on JustWatch.
- name: The actor or director's name.
- character_name: The character name.
- role: ACTOR or DIRECTOR.
- Developing a content-based recommender system using the genres and/or descriptions.
- Identifying the main content available on the streaming.
- Network analysis on the cast of the titles.
- Exploratory data analysis to find interesting insights.
If you want to see how I obtained these data, please check my GitHub repository.
All data were collected from JustWatch.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cinema Context is an online MySQL database containing places, persons and companies involved in more than 100,000 film screenings since 1895. CC provides insight into the ‘DNA’ of Dutch film and cinema culture and is praised by film historians worldwide. With a DANS Small Data Project grant, this data set has been converted to a Linked Data format (RDF). This data deposit contains both the RDF data set and the script used to convert the MySQL database into RDF.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Many television shows follow the “will they or won’t they” trope, where the dynamic between a pair of main characters constantly shifts between friendship and something more throughout the run of the series. This trope has persisted throughout the decades, and examples include Sam and Diane from the 1980s show Cheers and Jess and Nick from the 2010s show New Girl. In some cases, the audience may wait multiple seasons before a couple like this gets together, and some suspect that producers delay the moment to create suspense and keep viewers engaged. Events marking major romantic milestones, such as the pair’s first kiss, often change the trajectory of the plot, influence the number of viewers tuning into the show, and drive up episode ratings. In this project, we scrape viewer ratings from the Internet Movie Database (IMDb) for 150 popular couples from 125 television series and then model the plot shifts following episodes with romantic milestones using causal inference methods. Specifically, we construct an interrupted time series model, where the interruption is the episode in which each couple has their first kiss. From this model, we assess whether these interruptions are associated with changes in viewer ratings on average.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This data set was created to list all shows available on Disney+ streaming, and analyze the data to find interesting facts. This data was acquired in May 2022 containing data available in the United States.
This dataset has two files containing the titles (titles.csv) and the cast (credits.csv) for the title.
This dataset contains +1500 unique titles on Disney+ with 15 columns containing their information, including:
- id: The title ID on JustWatch.
- title: The name of the title.
- show_type: TV show or movie.
- description: A brief description.
- release_year: The release year.
- age_certification: The age certification.
- runtime: The length of the episode (SHOW) or movie.
- genres: A list of genres.
- production_countries: A list of countries that produced the title.
- seasons: Number of seasons if it's a SHOW.
- imdb_id: The title ID on IMDB.
- imdb_score: Score on IMDB.
- imdb_votes: Votes on IMDB.
- tmdb_popularity: Popularity on TMDB.
- tmdb_score: Score on TMDB.
And over +26k credits of actors and directors on Disney+ titles with 5 columns containing their information, including:
- person_ID: The person ID on JustWatch.
- id: The title ID on JustWatch.
- name: The actor or director's name.
- character_name: The character name.
- role: ACTOR or DIRECTOR.
- Developing a content-based recommender system using the genres and/or descriptions.
- Identifying the main content available on the streaming.
- Network analysis on the cast of the titles.
- Exploratory data analysis to find interesting insights.
If you want to see how I obtained these data, please check my GitHub repository.
All data were collected from JustWatch.
I love movies.
I tend to avoid marvel-transformers-standardized products, and prefer a mix of classic hollywood-golden-age and obscure polish artsy movies. Throw in an occasional japanese-zombie-slasher-giallo as an alibi. Good movies don't exist without bad movies.
On average I watch 200+ movies each year, with peaks at more than 500 movies. Nine years ago I started to log my movies to avoid watching the same movie twice, and also assign scores. Over the years, it gave me a couple insights on my viewing habits but nothing more than what a tenth-grader would learn at school.
I've recently suscribed to Netflix and it pains me to see the global inefficiency of recommendation systems for people like me, who mostly swear by "La politique des auteurs". It's a term coined by famous new-wave french movie critic André Bazin, meaning that the quality of a movie is essentially linked to the director and it's capacity to execute his vision with his crew. We could debate it depends on movie production pipeline, but let's not for now. Practically, what it means, is that I essentially watch movies from directors who made films I've liked.
I suspect Neflix calibrate their recommandation models taking into account the way the "average-joe" chooses a movie. A few months ago I had read a study based on a survey, showing that people chose a movie mostly based on genre (55%), then by leading actors (45%). Director or Release Date were far behind around 10% each. It is not surprising, since most people I know don't care who the director is. Lots of US blockbusters don't even mention it on the movie poster. I am aware that collaborative filtering is based on user proximity , which I believe decreases (or even eliminates) the need to characterize a movie. So here I'm more interested in content based filtering which is based on product proximity for several reasons :
Users tastes are not easily accessible. It is, after all, Netflix treasure chest
Movie offer on Netflix is so bad for someone who likes author's movies that it wouldn't help
Modeling a movie intrinsic qualities is a nice challenge
Enough.
"*The secret of getting ahead is getting started*" (Mark Twain)
https://img11.hostingpics.net/pics/117765networkgraph.png" alt="network graph">
The primary source is www.themoviedb.org. If you watch obscure artsy romanian homemade movies you may find only 95% of your movies referenced...but for anyone else it should be in the 98%+ range.
movies details are from www.themoviedb.org API : movies/details
movies crew & casting are from www.themoviedb.org API : movies/credits
both can be joined by id
they contain all 350k movies up, from end of 19th century to august 2017. If you remove short movies from imdb you get similar amounts of movies.
I uploaded the program to retrieve incremental movie details on github : https://github.com/stephanerappeneau/scienceofmovies/tree/master/PycharmProjects/GetAllMovies (need a dev API key from themoviedb.org though)
I have tried various supervised (decision tree) / unsupervised (clustering, NLP) approaches described in the discussions, source code is on github : https://github.com/stephanerappeneau/scienceofmovies
As a bonus I've uploaded the bio summary from top 500 critically-acclaimed directors from wikipedia, for some interesting NLTK analysis
Here is overview of the available sources that I've tried :
• Imdb.com free csv dumps (ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/temporaryaccess/) are badly documented, incomplete, loosely structured and impossible to join/merge. There's an API hosted by Amazon Web Service : 1€ every 100 000 requests. With around 1 million movies, it could become expensive also features are bare. So I've searched for other sources.
• www.themoviedb.org is based on crowdsourcing and has an excellent API, limited to 40 requests every 10 seconds. It is quite generous, well documented, and enough to sweep the 450 000 movies in a few days. For my purpose, data quality is not significantly worse than imdb, and as imdb key is also included there's always the possibility to complete my dataset later (I actually did it)
• www.Boxofficemojo.com has some interesting budget/revenue figures (which are sorely lacking in both imdb & tmdb), but it actually tracks only a few thousand movies, mainly blockbusters. There are other professional sources that are used by film industry to get better predictive / marketing insights but that's beyond my reach for this experiment.
• www.wikipedia.com is an interesting source with no real cap on API calls, however it requires a bit of webscraping and for movies or directors the layout and quality varies a lot, so I suspected it'd get a lot of work to get insights so I put this source in lower priority.
• www.google.com will ban you after a few minutes of web scraping because their job is to scrap data from others, than sell it, duh.
• It's worth mentionning that there are a few dumps of Netflix anonymized user tastes on kaggle, because they've organised a few competitions to improve their recommendation models. https://www.kaggle.com/netflix-inc/netflix-prize-data
• Online databases are largely white anglo-saxon centric, meaning bollywood (India is the 2nd bigger producer of movies) offer is mostly absent from datasets. I'm fine with that, as it's not my cup of tea plus I lack domain knowledge. The sheer amount of indian movies would probably skew my results anyway (I don't want to have too many martial-arts-musicals in my recommendations ;-)). I have, however, tremendous respect for indian movie industry so I'd love to collaborate with an indian cinephile !
https://img11.hostingpics.net/pics/340226westerns.png" alt="Westerns">
Starting from there, I had multiple problem statements for both supervised / unsupervised machine learning
Can I program a tailored-recommendation system based on my own criteria ?
What are the characteristics of movies/directors I like the most ?
What is the probability that I will like my next movie ?
Can I find the data ?
One of the objectives of sharing my work here is to find cinephile data-scientists who might be interested and, hopefully, contribute or share insights :) Other interesting leads : use tagline for NLP/Clustering/Genre guessing, leverage on budget/revenue, link with other data sources using the imdb normalized title, etc.
https://img11.hostingpics.net/pics/977004matrice.png" alt="Correlation matrix">
I've graduated from an french engineering school, majoring in artificial intelligence, but that was 17 years ago right in the middle of A.I-winter. Like a lot of white male rocket scientists, I've ended up in one of the leading european investment bank, quickly abandonning IT development to specialize in trading/risk project management and internal politics. My recent appointment in the Data Office made me aware of recent breakthroughts in datascience, and I thought that developing a side project would be an excellent occasion to learn something new. Plus it'd give me a well-needed credibility which too often lack decision makers when it comes to datascience.
I've worked on some of the features with Cédric Paternotte, a fellow friend of mine who is a professor of philosophy of sciences in La Sorbonne. Working with someone with a different background seem a good idea for motivation, creativity and rigor.
Kudos to www.themoviedb.org or www.wikipedia.com sites, who really have a great attitude towards open data. This is typically NOT the case of modern-bigdata companies who mostly keep data to themselves to try to monetize it. Such a huge contrast with imdb or instagram API, which generously let you grab your last 3 comments at a miserable rate. Even if 15 years ago this seemed a mandatory path to get services for free, I predict one day governments will need to break this data monopoly.
[Disclaimer : I apologize in advance for my engrish (I'm french ^-^), any bad-code I've written (there are probably hundreds of way to do it better and faster), any pseudo-scientific assumption I've made, I'm slowly getting back in statistics and lack senior guidance, one day I regress a non-stationary time series and the day after I'll discover I shouldn't have, and any incorrect use of machine-learning models]
https://img11.hostingpics.net/pics/898068408x161poweredbyrectanglegreen.png" alt="powered by themoviedb.org">
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
(see https://tblock.github.io/10kGNAD/ for the original dataset page)
This page introduces the 10k German News Articles Dataset (10kGNAD) german topic classification dataset. The 10kGNAD is based on the One Million Posts Corpus and avalaible under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. You can download the dataset here.
English text classification datasets are common. Examples are the big AG News, the class-rich 20 Newsgroups and the large-scale DBpedia ontology datasets for topic classification and for example the commonly used IMDb and Yelp datasets for sentiment analysis. Non-english datasets, especially German datasets, are less common. There is a collection of sentiment analysis datasets assembled by the Interest Group on German Sentiment Analysis. However, to my knowlege, no german topic classification dataset is avaliable to the public.
Due to grammatical differences between the English and the German language, a classifyer might be effective on a English dataset, but not as effectiv on a German dataset. The German language has a higher inflection and long compound words are quite common compared to the English language. One would need to evaluate a classifyer on multiple German datasets to get a sense of it's effectivness.
The 10kGNAD dataset is intended to solve part of this problem as the first german topic classification dataset. It consists of 10273 german language news articles from an austrian online newspaper categorized into nine topics. These articles are a till now unused part of the One Million Posts Corpus.
In the One Million Posts Corpus each article has a topic path. For example Newsroom/Wirtschaft/Wirtschaftpolitik/Finanzmaerkte/Griechenlandkrise
.
The 10kGNAD uses the second part of the topic path, here Wirtschaft
, as class label.
In result the dataset can be used for multi-class classification.
I created and used this dataset in my thesis to train and evaluate four text classifyers on the German language. By publishing the dataset I hope to support the advancement of tools and models for the German language. Additionally this dataset can be used as a benchmark dataset for german topic classification.
As in most real-world datasets the class distribution of the 10kGNAD is not balanced. The biggest class Web consists of 1678, while the smalles class Kultur contains only 539 articles. However articles from the Web class have on average the fewest words, while artilces from the culture class have the second most words.
I propose a stratifyed split of 10% for testing and the remaining articles for training.
To use the dataset as a benchmark dataset, please used the train.csv
and test.csv
files located in the project root.
Python scripts to extract the articles and split them into a train- and a testset avaliable in the code directory of this project.
Make sure to install the requirements.
The original corpus.sqlite3
is required to extract the articles (download here (compressed) or here (uncompressed)).
This dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Please consider citing the authors of the One Million Post Corpus if you use the dataset.
Dataset Card for "HebrewMetaphors"
Dataset Summary
A common dataset for text classification task is IMDb. Large Movie Review Dataset. This is a dataset for binary sentiment classification. The first step in our project was to create a Hebrew dataset with an IMDB-like structure but different in that, in addition to the sentences we have, there will also be verb names, and a classification of whether the verb name is literal or metaphorical in the given sentence. Using an… See the full description on the dataset page: https://huggingface.co/datasets/tdklab/HebrewMetaphors.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy