https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('imdb_reviews', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides detailed information on IMDb movies and television shows, integrating descriptions sourced from Rotten Tomatoes. It contains data for approximately 7800 titles, primarily from the 1990s onwards, and has been filtered to include English language content with specific criteria for ratings and votes. The purpose of this dataset is to facilitate projects involving cross-content analysis, content-based recommendation systems, and genre prediction tasks. It offers a rich resource for understanding entertainment media consumption and developing machine learning applications.
The dataset comprises approximately 7800 individual movie and TV show records. It is typically provided in a CSV file format. The data has been curated, filtering the original IMDb dataset to focus on content from the 1990s through to 2023. Only titles in English ('en') have been retained, and specific rating and vote thresholds have been applied, such as movies/shows from the 90s-00s with ratings of 7.9 or higher, and those from the 2000s onwards with ratings of 6.5 or higher. Titles from Canada, Greater Britain, India, and the USA are represented.
This dataset is highly suitable for various analytical and machine learning tasks, including: * Developing content-based recommendation systems using genres, descriptions, and ratings. * Performing exploratory data analysis on movie and TV show trends. * Implementing Natural Language Processing (NLP) techniques on title descriptions for insights or feature extraction. * Executing multi-label classification to predict genres from description data. * Clustering movies and shows based on their descriptions and genre attributes. * Aiding projects that require cross-content analysis across different media types.
The dataset primarily covers movies and TV shows released from 1990 to 2023. Geographically, the data includes titles relevant to Canada, Greater Britain, India, and the USA. There is no specific demographic scope mentioned beyond the inclusion of English-language titles. The dataset has specific filtering criteria for data availability based on rating scores and the number of votes, ensuring a focus on well-received or highly-engaged content.
CCO
This dataset is ideal for: * Data Scientists and Analysts: For conducting exploratory data analysis, building predictive models, and deriving insights into media consumption. * Machine Learning Engineers: For developing and training recommendation engines, NLP models, and classification algorithms. * Researchers: Studying trends in film and television, cross-media analysis, and content categorisation. * Developers: Creating applications that require rich movie and TV show data, such as content discovery platforms. * Academics and Students: For educational purposes, coursework, and research projects in data science, AI, and media studies.
Original Data Source: IMDb Movies/Shows with Descriptions
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By imdb (From Huggingface) [source]
The IMDb Large Movie Review Dataset is a comprehensive collection of movie reviews used for sentiment classification. The dataset includes a wide range of movie reviews along with their corresponding sentiment labels, which indicate whether the review is positive or negative in nature. This invaluable dataset is aimed at facilitating sentiment analysis and classification tasks in the field of natural language processing.
The main purpose of the train.csv file within this dataset is to provide a curated collection of movie reviews, each accompanied by its respective sentiment label. This file proves particularly useful for training machine learning models to accurately predict sentiment and classify reviews based on their emotional tone.
Similarly, the test.csv file contains another set of movie reviews along with corresponding sentiment labels. Meant for testing and validating the performance of trained models, this dataset enables researchers and developers to evaluate their models' effectiveness in real-world scenarios.
Additionally, the unsupervised.csv file offers an alternative subset within the dataset. Unlike train.csv and test.csv, unsupervised.csv does not include any associated sentiment labels for individual movie reviews. This specific subset serves as a valuable resource for exploring unsupervised learning techniques within the domain of sentiment classification.
By utilizing this meticulously compiled IMDb Large Movie Review Dataset, researchers and data scientists can delve into various aspects related to analyzing sentiments in textual data. With its carefully labeled data points covering both positive and negative sentiments expressed in diverse film critiques, this dataset empowers users to develop sophisticated machine learning algorithms that accurately assess subjective opinions from text data
Introduction:
Dataset Overview: - Train.csv: This file contains a set of movie reviews along with their sentiment labels. It is intended for training your sentiment analysis models. - Test.csv: This file provides another set of movie reviews along with their corresponding sentiment labels. You can use this file to evaluate the performance of your trained models. - Unsupervised.csv: This file includes movie reviews without any associated sentiment labels. It can be used for unsupervised sentiment classification tasks.
Columns in the Dataset: - text: The main column containing the text of each movie review. - label: The sentiment label assigned to each review, indicating whether it is positive or negative.
Guidelines for Using the Dataset:
Training Your Model:
- Begin by loading and preprocessing the data from train.csv
- Treat 'text' as your input feature and 'label' as your target variable
- Explore different machine learning or deep learning algorithms suitable for text classification
- Train your model using various techniques, such as bag-of-words, word embeddings, or transformers
- Evaluate and fine-tune your model's performance using test.csv
Evaluating Your Model:
- Load test.csv and preprocess the data similar to what you did with train.csv
- Use this preprocessed test data to evaluate the accuracy, precision, recall, F1 score or other relevant metrics of your trained model on unseen data
- Analyze these metrics to understand how well your model is performing in predicting sentiments
Advancing Your Model (Unsupervised Classification):
- Utilize unsupervised.csv for unsupervised sentiment classification tasks
- Preprocess the movie reviews in this file and explore techniques like clustering, topic modeling, or self-supervised learning
- Extract patterns, themes, or sentiments from the reviews without any guidance from labeled data
Conclusion:
- Sentiment Analysis: This dataset can be used to train models for sentiment analysis, where the goal is to predict whether a movie review is positive or negative based on its text.
- NLP Research: The dataset can be used for various natural language processing (NLP) tasks such as text classification, information extraction, or named entity recognition. Researchers and practitioners can leverage this dataset to develop and evaluate new algorithms and techniques in the field of NLP.
- Recommendation Systems: The sentiment labels in this dataset can be used as a source of feedback or user preferences for recommendation systems. By analyzing the sentiments expressed in reviews,...
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Introduction
TMDB.org is a crowd-sourced movie information database used by many film-related consoles, sites and apps, such as XBMC, MythTV and Plex. Dozens of media managers, mobile apps and social sites make use of its API. TMDb lists some 80,000 films at time of writing, which is considerably fewer than IMDb. While not as complete as IMDb, it holds extensive information for most popular/Hollywood films. This is dataset of the 10,000 most popular movies across the world has been fetched through the read API. TMDB's free API provides for developers and their team to programmatically fetch and use TMDb's data. Their API is to use as long as you attribute TMDb as the source of the data and/or images. Also, they update their API from time to time.
This data set is fetched using exception handling process so the data set contains some null values as there are missing fields in the tmdb database. Thought it's good for a young analyst to deal with messing value. Hey analyst are you all excited?
Original Data Source: Popular Movies of IMDb
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains movie plots extracted from Wikipedia, along with other key metadata. It is specifically curated for movies released between 1950 and 2023 that have accumulated over 1000 ratings on IMDb. The primary purpose of this dataset is to facilitate development in Large Language Models (LLMs) for applications such as movie searching or recommendation systems. The plot summaries have been meticulously cleaned to remove irrelevant elements like links and references, ensuring a pure text value. Where Wikipedia plots were unavailable, IMDb synopses were used as a fallback. The dataset includes 89% of movies with detailed plot information, while 100% include a short summary untouched from Wikipedia, which is useful for matching metadata in retriever applications. Columns like 'stars', 'directors', and 'genres' are provided as lists of values, making them suitable for direct loading into vector databases.
The data file is typically in CSV format. The dataset spans movies released from 1950 up to 2023. There are 20,617 unique movie titles, 21,596 unique star names, and 9,863 unique director names. The genres column contains 21,675 unique values. Movie runtimes range from -1 to 776 minutes, with a significant majority (17,433 entries) falling between 76.70 and 115.55 minutes. The number of ratings (ratingCount
) varies widely, starting from 1,001 and going up to 2.73 million. IMDb ratings range from 1.2 to 9.3. While specific total row/record counts are not available, the distribution data for year
, runtime
, ratingCount
, and imdb_rating
show various value counts within different ranges.
This dataset is ideal for: * Developing demonstration projects leveraging Large Language Models (LLMs). * Creating movie search applications, such as the example of a movie searching app like cinemattr.ca. * Building retriever applications where the 'summary' column can be used for metadata matching. * Populating vector databases with structured information from 'stars', 'directors', and 'genres' for advanced querying and analysis.
The dataset's geographic scope is global. It includes movies released within the time frame of 1950 to 2023. The data availability specifies that 89% of the movies have detailed plot information, and all movies (100%) include a short summary. The dataset focuses on films with more than 1000 ratings on IMDb.
CC0
This dataset is suitable for: * AI and machine learning developers who are building models based on natural language processing. * Data scientists and researchers interested in film data and entertainment analytics. * Software engineers developing applications that require movie plot summaries or metadata, such as recommendation engines. * Students and enthusiasts looking for high-quality, pre-processed text data for LLM projects.
Original Data Source: Movie Plots from Wikipedia
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The relational in-memory database (IMDB) market is experiencing robust growth, driven by the increasing demand for real-time analytics and applications requiring ultra-low latency data processing. The market, estimated at $15 billion in 2025, is projected to expand at a Compound Annual Growth Rate (CAGR) of 18% between 2025 and 2033, reaching approximately $60 billion by 2033. This growth is fueled by several key factors. Firstly, the rise of big data and the need for faster insights across various sectors like finance, healthcare, and telecommunications are propelling adoption. Secondly, advancements in technology, such as improved memory capacity and processing power, are making IMDBs more affordable and efficient. Finally, cloud computing platforms are playing a significant role, offering scalable and cost-effective IMDB solutions. Major players like Microsoft, IBM, Oracle, and Amazon are investing heavily in this space, leading to increased competition and innovation. While the market faces challenges such as data security concerns and the complexity of integrating IMDBs into existing systems, these are likely to be mitigated by continuous technological advancements and increasing industry expertise. Despite the overall positive outlook, market segmentation reveals distinct growth patterns. The financial services sector is currently the largest adopter of IMDB technology, followed by the telecommunications and healthcare industries. Geographic distribution shows that North America and Europe currently hold the largest market shares, but significant growth is anticipated in Asia-Pacific regions due to increasing digitalization and data generation. Challenges remain in ensuring data consistency and managing the potential cost overhead associated with high-memory requirements. However, the continuous development of efficient memory management techniques and the integration of IMDBs with advanced analytics tools are likely to address these concerns and further fuel market expansion. The long-term outlook for the relational in-memory database market remains exceptionally promising, suggesting consistent high-growth potential well into the next decade.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Introduction
TMDB.org is a crowd-sourced movie information database used by many film-related consoles, sites and apps, such as XBMC, MythTV and Plex. Dozens of media managers, mobile apps and social sites make use of its API. TMDb lists some 80,000 films at time of writing, which is considerably fewer than IMDb. While not as complete as IMDb, it holds extensive information for most popular/Hollywood films. This is dataset of the 10,000 most popular movies across the world has been fetched through the read API. TMDB's free API provides for developers and their team to programmatically fetch and use TMDb's data. Their API is to use as long as you attribute TMDb as the source of the data and/or images. Also, they update their API from time to time.
This data set is fetched using exception handling process so the data set contains some null values as there are missing fields in the tmdb database. Thought it's good for a young analyst to deal with messing value.
Hey analyst are you all excited?
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides a comprehensive collection of all titles (Movies and TV Series) available on Netflix. In addition to basic information, it includes IMDb-specific data like IMDb ID, Average Rating, and Number of Votes.
A dataset is updated daily at 10:00 AM CET. If you find this dataset helpful, feel free to give it an upvote! 😊
You can find all our APIs, maintained and developed by us, at the following link: octopusteam.dev. These APIs provide access to various features and data, ensuring high-quality and reliable integration options for your needs.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Market Overview: The global in-memory database (IMDB) market is poised for substantial growth, with a projected CAGR of 19.00% from 2025 to 2033. The market size, valued at XX million in 2025, is attributed to the increasing adoption of IMDBs in various industries, including telecommunications, BFSI, logistics, retail, entertainment, and healthcare. Key drivers behind this growth include the need for real-time data processing, improved performance, and the rise of big data and analytics. Market Dynamics: The IMDB market is influenced by several trends and challenges. The growing adoption of cloud-based IMDB solutions is a key trend, as it provides flexibility and cost-effectiveness. However, security concerns and latency issues associated with cloud-based deployments pose challenges. Additionally, the increasing demand for high-performance computing and the need for faster data processing are driving the development of advanced IMDB technologies. The market is fragmented, with established players such as IBM, Oracle, and Microsoft competing alongside emerging startups like VoltDB and MemSQL. Regional variations in market maturity and adoption rates are also observed, with North America leading the way in terms of market penetration. Recent developments include: May 2022: IBM and SAP announced the extension of their collaboration as IBM embarks on a corporate transformation initiative to optimize its business operations using RISE and SAP S/4HANA Cloud. To execute work for over 1,000 legal entities in more than 120 countries and multiple IBM companies supporting hardware, software, consulting, and finance, IBM said it is transferring to SAP S/4HANA, SAP's most recent ERP system, as part of the extended relationship. The replacement for SAP R/3 and SAP ERP, SAP S/4HANA, is SAP's ERP system for large businesses. It is intended to work optimally with SAP's in-memory database, SAP HANA., November 2022: Redis, a provider of real-time in-memory databases, and Amazon Web Services have announced a multi-year strategic alliance. Redis is a networked, open-source NoSQL system that stores data on disk for durability before moving it to DRAM as necessary. It can function as a streaming engine, message broker, database, or cache. The business claims that when Redis is used as a database, apps may instantly search across tens of millions of rows of customer data to locate information specific to one particular customer. A managed database-as-a-service product on AWS is called the real-time Redis Enterprise Cloud., December 2022: The National Stock Exchange, the largest stock exchange in India, chose the Raima Database Manager (RDM) Workgroup 12.0 in-memory system as a foundational component for the next iterations of its trading platform front-end, the National Exchange for Automated Trading (NEAT).. Key drivers for this market are: Decreasing Hardware Cost, Increasing Penetration Of Trends Like Big Data And IOT; Increase In The Volume Of Data Generated And Shift Of Enterprise Operations. Potential restraints include: Resilience In Integration With VLDB'S. Notable trends are: Telecommunication End-User Industry to Hold Significant Market Share.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In show business, awards are conferred to persons and films to provide incentives to performers’ future career development through periodic film festivals and events. In this work, we focused on exploring the growth and dynamics of the film award system, the structure of the award network, and the relationships between historical performance, collaborations, and future career success of performers in the movie industry. We collected data from IMDb, which covers more than 3.5K movie events for 520K individuals with their award-winning and career records for over 90 years. By using network analysis and regression models, we find several novel results. At first, we found the exponential proliferation of awards across all genres of films and all professions of individuals and the uneven distribution of the number of awards in careers across time. More than 30% of the performers have won multiple awards. Second, we built an award network to reveal the interlocks between awards based on multiple award-winning phenomena. We found that for prestigious awards, 47% of the linkages were over-representative than the expectations from the null model. Furthermore, the performers’ collaboration network was highly clustered, exhibiting a high propensity of linkages between awarded performers. Lastly, our regression models revealed that multiple factors were related to performers’ early career success and award winning. Specifically, we showed that along with the performers’ historical achievements, their collaborators serve an important role in award winning after being nominated, with the scope and depth of the impact differing in the awards’ prestige. This work has strong implications for the harmonious dynamics of the movie industry and the career development of performers.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
"Oppenheimer," directed by the legendary Christopher Nolan, is set to grace theaters on July 21, 2023. This cinematic masterpiece offers an enthralling journey into history, recounting the extraordinary life of J. Robert Oppenheimer, a pivotal figure in the development of the atomic bomb during World War II.
CC0
Original Data Source: Oppenheimer IMDb reviews
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This folder contains the data behind the story 'Straight Outta Compton' Is The Rare Biopic Not About White Dudes.
biopics.csv
contains the following variables:
Variable | Definition |
---|---|
title | Title of the film. |
site | URL from IMDB. |
country | Country of origin. |
year_released | Year of release. |
box_office | Gross earnings at U.S. box office. |
director | Director of film. |
number_of_subjects | The number of subjects featured in the film. |
subject | The actual name of the featured subject. |
type_of_subject | The occupation of subject or reason for recognition. |
race_known | Indicates whether the subject’s race was discernible based on background of self, parent, or grandparent. |
subject_race | Race of the subject. |
person_of_color | Dummy variable that indicates person of color. |
subject_sex | Sex of subject. |
lead_actor_actress | The actor or actress who played the subject. |
Source: IMDb.
This is a dataset from FiveThirtyEight hosted on their GitHub. Explore FiveThirtyEight data using Kaggle and all of the data sources available through the FiveThirtyEight organization page!
This dataset is maintained using GitHub's API and Kaggle's API.
This dataset is distributed under the Attribution 4.0 International (CC BY 4.0) license.
Cover photo by Denisse Leon on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Amazon Prime TV Shows and Movies Dataset offered by Crawl Feeds is an extensive resource containing over 92,000 records in JSON format. This dataset encompasses a wide array of data points, including links, titles, descriptions, release dates, genres, posters, streaming platforms, countries, number of seasons, content ratings, IMDb ratings, cast and crew details, unique identifiers, and scraping timestamps. Such comprehensive information is invaluable for researchers, data analysts, and developers aiming to conduct in-depth analyses, develop recommendation systems, or explore trends within Amazon Prime's content library.
For those interested in broader media datasets, Crawl Feeds also offers the Movies and TV Shows Dataset, which includes 118,000 records, and the IMDb Movie Details Dataset, comprising 250,000 records. These datasets provide extensive information across various platforms, facilitating comparative studies and cross-platform analyses.
Integrating these datasets into your projects can significantly enhance the depth and quality of your analyses, providing a robust foundation for exploring various facets of the entertainment industry. Whether you're developing a new application, conducting market research, or performing academic studies, these datasets serve as a valuable resource for gaining insights into the dynamic world of streaming media.
Explore the Amazon Prime TV Shows and Movies Dataset and other related datasets on Crawl Feeds to elevate your data-driven projects.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for IMDB Kurdish
Dataset Summary
Central Kurdish translation of the famous IMDB movie reviews dataset. The dataset contains 50K highly polar movie reviews, divided into two equal classes of positive and negative reviews. We can perform binary sentiment classification using this dataset. The availability of datasets in Kurdish, such as the IMDB movie reviews dataset, can help researchers and developers train and evaluate machine learning models for Kurdish… See the full description on the dataset page: https://huggingface.co/datasets/razhan/imdb_ckb.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides IMDb user reviews for Christopher Nolan's highly anticipated film "**Oppenheimer**," which premiered on July 21, 2023. The film offers an engaging journey into history, recounting the extraordinary life of J. Robert Oppenheimer, a pivotal figure in the development of the atomic bomb during World War II. This collection of reviews allows for an insightful examination of public sentiment and audience reactions to this cinematic masterpiece.
The dataset is presented in a tabular format, comprising individual user reviews linked with their respective ratings. It contains 2445 entries or rows. The ratings span from 1.00 to 10.00, with a significant proportion of scores concentrated in the higher ranges. While specific file type details are not provided, data files of this nature are typically available in formats such as CSV.
This dataset is ideally suited for: * Analysing audience sentiment and public opinion regarding the film "Oppenheimer." * Performing Natural Language Processing (NLP) tasks on unstructured movie review text, such as topic modelling or entity extraction. * Developing and evaluating sentiment analysis models to predict review polarity. * Visualising movie ratings distribution and identifying trends in audience reception. * Academic and market research into film criticism, audience engagement, and the public's response to historical dramas.
CC0
Original Data Source: Oppenheimer IMDb reviews
I love movies.
I tend to avoid marvel-transformers-standardized products, and prefer a mix of classic hollywood-golden-age and obscure polish artsy movies. Throw in an occasional japanese-zombie-slasher-giallo as an alibi. Good movies don't exist without bad movies.
On average I watch 200+ movies each year, with peaks at more than 500 movies. Nine years ago I started to log my movies to avoid watching the same movie twice, and also assign scores. Over the years, it gave me a couple insights on my viewing habits but nothing more than what a tenth-grader would learn at school.
I've recently suscribed to Netflix and it pains me to see the global inefficiency of recommendation systems for people like me, who mostly swear by "La politique des auteurs". It's a term coined by famous new-wave french movie critic André Bazin, meaning that the quality of a movie is essentially linked to the director and it's capacity to execute his vision with his crew. We could debate it depends on movie production pipeline, but let's not for now. Practically, what it means, is that I essentially watch movies from directors who made films I've liked.
I suspect Neflix calibrate their recommandation models taking into account the way the "average-joe" chooses a movie. A few months ago I had read a study based on a survey, showing that people chose a movie mostly based on genre (55%), then by leading actors (45%). Director or Release Date were far behind around 10% each. It is not surprising, since most people I know don't care who the director is. Lots of US blockbusters don't even mention it on the movie poster. I am aware that collaborative filtering is based on user proximity , which I believe decreases (or even eliminates) the need to characterize a movie. So here I'm more interested in content based filtering which is based on product proximity for several reasons :
Users tastes are not easily accessible. It is, after all, Netflix treasure chest
Movie offer on Netflix is so bad for someone who likes author's movies that it wouldn't help
Modeling a movie intrinsic qualities is a nice challenge
Enough.
"*The secret of getting ahead is getting started*" (Mark Twain)
https://img11.hostingpics.net/pics/117765networkgraph.png" alt="network graph">
The primary source is www.themoviedb.org. If you watch obscure artsy romanian homemade movies you may find only 95% of your movies referenced...but for anyone else it should be in the 98%+ range.
movies details are from www.themoviedb.org API : movies/details
movies crew & casting are from www.themoviedb.org API : movies/credits
both can be joined by id
they contain all 350k movies up, from end of 19th century to august 2017. If you remove short movies from imdb you get similar amounts of movies.
I uploaded the program to retrieve incremental movie details on github : https://github.com/stephanerappeneau/scienceofmovies/tree/master/PycharmProjects/GetAllMovies (need a dev API key from themoviedb.org though)
I have tried various supervised (decision tree) / unsupervised (clustering, NLP) approaches described in the discussions, source code is on github : https://github.com/stephanerappeneau/scienceofmovies
As a bonus I've uploaded the bio summary from top 500 critically-acclaimed directors from wikipedia, for some interesting NLTK analysis
Here is overview of the available sources that I've tried :
• Imdb.com free csv dumps (ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/temporaryaccess/) are badly documented, incomplete, loosely structured and impossible to join/merge. There's an API hosted by Amazon Web Service : 1€ every 100 000 requests. With around 1 million movies, it could become expensive also features are bare. So I've searched for other sources.
• www.themoviedb.org is based on crowdsourcing and has an excellent API, limited to 40 requests every 10 seconds. It is quite generous, well documented, and enough to sweep the 450 000 movies in a few days. For my purpose, data quality is not significantly worse than imdb, and as imdb key is also included there's always the possibility to complete my dataset later (I actually did it)
• www.Boxofficemojo.com has some interesting budget/revenue figures (which are sorely lacking in both imdb & tmdb), but it actually tracks only a few thousand movies, mainly blockbusters. There are other professional sources that are used by film industry to get better predictive / marketing insights but that's beyond my reach for this experiment.
• www.wikipedia.com is an interesting source with no real cap on API calls, however it requires a bit of webscraping and for movies or directors the layout and quality varies a lot, so I suspected it'd get a lot of work to get insights so I put this source in lower priority.
• www.google.com will ban you after a few minutes of web scraping because their job is to scrap data from others, than sell it, duh.
• It's worth mentionning that there are a few dumps of Netflix anonymized user tastes on kaggle, because they've organised a few competitions to improve their recommendation models. https://www.kaggle.com/netflix-inc/netflix-prize-data
• Online databases are largely white anglo-saxon centric, meaning bollywood (India is the 2nd bigger producer of movies) offer is mostly absent from datasets. I'm fine with that, as it's not my cup of tea plus I lack domain knowledge. The sheer amount of indian movies would probably skew my results anyway (I don't want to have too many martial-arts-musicals in my recommendations ;-)). I have, however, tremendous respect for indian movie industry so I'd love to collaborate with an indian cinephile !
https://img11.hostingpics.net/pics/340226westerns.png" alt="Westerns">
Starting from there, I had multiple problem statements for both supervised / unsupervised machine learning
Can I program a tailored-recommendation system based on my own criteria ?
What are the characteristics of movies/directors I like the most ?
What is the probability that I will like my next movie ?
Can I find the data ?
One of the objectives of sharing my work here is to find cinephile data-scientists who might be interested and, hopefully, contribute or share insights :) Other interesting leads : use tagline for NLP/Clustering/Genre guessing, leverage on budget/revenue, link with other data sources using the imdb normalized title, etc.
https://img11.hostingpics.net/pics/977004matrice.png" alt="Correlation matrix">
I've graduated from an french engineering school, majoring in artificial intelligence, but that was 17 years ago right in the middle of A.I-winter. Like a lot of white male rocket scientists, I've ended up in one of the leading european investment bank, quickly abandonning IT development to specialize in trading/risk project management and internal politics. My recent appointment in the Data Office made me aware of recent breakthroughts in datascience, and I thought that developing a side project would be an excellent occasion to learn something new. Plus it'd give me a well-needed credibility which too often lack decision makers when it comes to datascience.
I've worked on some of the features with Cédric Paternotte, a fellow friend of mine who is a professor of philosophy of sciences in La Sorbonne. Working with someone with a different background seem a good idea for motivation, creativity and rigor.
Kudos to www.themoviedb.org or www.wikipedia.com sites, who really have a great attitude towards open data. This is typically NOT the case of modern-bigdata companies who mostly keep data to themselves to try to monetize it. Such a huge contrast with imdb or instagram API, which generously let you grab your last 3 comments at a miserable rate. Even if 15 years ago this seemed a mandatory path to get services for free, I predict one day governments will need to break this data monopoly.
[Disclaimer : I apologize in advance for my engrish (I'm french ^-^), any bad-code I've written (there are probably hundreds of way to do it better and faster), any pseudo-scientific assumption I've made, I'm slowly getting back in statistics and lack senior guidance, one day I regress a non-stationary time series and the day after I'll discover I shouldn't have, and any incorrect use of machine-learning models]
https://img11.hostingpics.net/pics/898068408x161poweredbyrectanglegreen.png" alt="powered by themoviedb.org">
https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
DESCRIPTION: ACTIV-ES is a comparable Spanish corpus comprised of film dialogue from Argentine, Mexican and Spanish productions. Titles for each of these three countries were seeded from the Internet Movie Database, subtitle data for the hearing impaired was provided by Opensubtitles.org and was post-processed to correct/remove subtitle, OCR and diacritic artifacts and annotated for part-of-speech.
The data is available in two main formats: 1) running text for each document and 2) 1:5 gram aggregate files. Each format includes a plain text and part-of-speech annotated version. Document names reflect the language code, country, year, title, type, genre (first genre listed in the IMDb), and IMDb ID.
For more information about the development and evaluation of these resources and to cite this work refer to:
Francom, J., Hulden, M. and Ussishkin, A.. (2014) ACTIV-ES: a comparable, cross-dialect corpus of 'everyday' Spanish from Argentina, Mexico, and Spain. In Proceedings of the Ninth Annual Language Resources and Evaluation Conference, Reykjavik, Iceland. European Language Resources Association (ELRA).
In version .02 of the tagged running format corpus in the /eagles directory has been added which includes the EAGLES tagset. This tagset is much more fleshed out than the simplified tagset in the /tagged directory. For information on the tagset refer here: http://nlp.lsi.upc.edu/freeling/doc/tagsets/tagset-es.html.
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
This dataset contains sentences labelled with positive or negative sentiment, extracted from reviews of products, movies, and restaurants.
sentence \t score
=======
Score is either 1 (for positive) or 0 (for negative)
The source for these sentences is: yelp.com
This dataset is an extract of a dataset created for the Paper 'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy