12 datasets found

Film Circulation dataset
zenodo.org
data.niaid.nih.gov
bin, csv, png
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
Explore at:
csv, png, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7887672
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
f
Cinando film festival programming dataset
figshare.com
txt
Updated May 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vejune Zemaityte; Andres Karjus; Ulrike Rohn; Maximilian Schich; Indrek Ibrus (2024). Cinando film festival programming dataset [Dataset]. http://doi.org/10.6084/m9.figshare.22682794.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22682794.v2
Dataset updated
May 17, 2024
Dataset provided by
figshare
Authors
Vejune Zemaityte; Andres Karjus; Ulrike Rohn; Maximilian Schich; Indrek Ibrus
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set supports the analyses presented in the paper titled Quantifying the global film festival circuit: Networks, diversity, and public value creation, published in PLoS ONE: https://doi.org/10.1371/journal.pone.0297404. The R code to reproduce the analyses is available at: https://github.com/andreskarjus/cinandofestivals. The data sample is sufficient for the reproduction of results and graphs. The graphs in this paper are also available as an interactive dashboard in an online supplementary, where details behind individual data points can be easily observed: https://andreskarjus.github.io/cinandofestivals.This research has been made possible by data provided directly to authors from the Cannes Film Market (Marché du Film – Festival de Cannes), the company operating the Cinando website and database (https://cinando.com/). Launched in 2003 as the database of the attendees at the Cannes Festival, Cinando has since grown into the premier platform supporting hundreds of film festivals and film markets (industry events held during festivals, mostly oriented to promoting investment opportunities, rights sales, and production services). Cinando offers film professionals tools to navigate the film industry, including information about contacts, films, projects in development, market screening schedules, market attendees, and screeners. The platform services film festivals and markets by facilitating rights sales, investments, and business-to-business video on demand. The platform relies on a large proprietary relational database. The authors received a full database dump via a research partnership with the data owner on 2021-10-08.This data set concerns the part of the Cinando database that registers information about film festival programming and contains, at face value, 77,398 films programmed at 38,367 festival events, resulting in 183,865 film–festival event pairs, between 2007–2022 (festivals_films.csv). The festival metadata includes event and, occasionally, festival series title, event location country, and event year. Film metadata contains runtime, production year, names of crew members (filmrole.csv - only crew role and gender are made available to protect privacy), origin countries (filmcountry.csv), languages spoken in the film (filmlang.csv), and content type tags (filmgen.csv). The latter is a mixture of tags typically used to describe films within the festival context, including genre (e.g. drama, documentary), target audience (children’s, family), identity (Jewish, LGBT), and production type (TV Series, VR). The authors have cleaned and homogenized the data to make it usable and expanded it with festival type and crew gender information. Cinando technical IDs and personal data have been anonymized. The data owner supports the publication of this data set.
Global box office revenue 2014-2021, by format
statista.com
Updated Dec 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2023). Global box office revenue 2014-2021, by format [Dataset]. https://www.statista.com/statistics/259987/global-box-office-revenue/
Explore at:
Dataset updated
Dec 7, 2023
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
In 2021, the global box office revenue added up to approximately 21.3 billion U.S. dollars, up from 11.8 billion dollars a year earlier – an annual increase of 80.5 percent. Still, the 2021 result amounted to only little more than half of the 42.3-billion-dollar box office revenue recorded in 2019, before the COVID-19 outbreak. Furthermore, the share of 3D films in the global revenue went from six percent in 2020 to 6.6 percent in 2021.

Cinema market: a challenging comeback The pandemic changed the film industry by emptying movie theaters and accelerating the increase in video streaming penetration. In the so-called North American movie market – which consists of Canada and the United States (including the unincorporated territories of Guam and Puerto Rico) – the box office revenue more than doubled between 2020 and 2021. But the latter figure amounted to less than 40 percent of the pre-COVID-19 result. Meanwhile, subscription video-on-demand (SVoD) platforms went even further. Netflix kept the top spot while new competitors such as Disney+ diversified the offering.

Big players on the big screen The global cinema segment spans way beyond North America, though. China alone sold more than 1.1 billion movie tickets throughout 2021, making it the leading market worldwide, right above the U.S. India ranked third with almost 380 million tickets sold that same year. With a vast film culture – even larger than its iconic Bollywood industry – India and its cinema feature a myriad of languages and advertising opportunities to its gargantuan audience.
M
Movie Rating Sites Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Movie Rating Sites Report [Dataset]. https://www.marketreportanalytics.com/reports/movie-rating-sites-75765
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Apr 10, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global movie rating sites market is experiencing robust growth, driven by the increasing consumption of online streaming services and the rising demand for credible film reviews before purchasing tickets or subscribing. The market's expansion is fueled by several factors, including the proliferation of smartphones and internet access, making it easier for users to access rating platforms. Furthermore, the integration of social media features on many platforms fosters engagement and user-generated content, creating a dynamic and interactive ecosystem. The market is segmented by application (movie promotion, movie research, audience choice, and others) and by rating type (user-based, professional-based, and others). While precise market sizing data is unavailable, given the significant presence of established players like Rotten Tomatoes and IMDb, and considering the considerable global viewership of movies, we can estimate the 2025 market size to be approximately $2 billion. This estimation accounts for advertising revenue, premium subscriptions (where applicable), and potential data licensing to film studios and distributors. The projected CAGR suggests continued substantial growth throughout the forecast period (2025-2033), likely driven by technological advancements and the ever-growing global movie-watching audience. However, potential restraints include the risk of biased reviews and the increasing competition from new platforms and emerging technologies like AI-powered recommendation systems. The North American market currently holds a significant share due to the established presence of major players and a large movie-going audience. However, rapid growth is anticipated in the Asia-Pacific region, particularly in countries like India and China, fueled by the expansion of streaming platforms and increasing internet penetration. Europe, with its diverse film culture and established digital infrastructure, also represents a substantial market segment. Competitive pressures are intensifying, with existing players continually innovating to enhance user experiences, introduce new features, and attract and retain users in a crowded market. The market's future trajectory will be shaped by the strategic moves of key players, technological disruptions, and evolving consumer preferences regarding how they discover and choose movies to watch. Strategic partnerships and acquisitions could also play a significant role in shaping the market landscape in the coming years.
Upcoming 2020 Hollywood Movies
kaggle.com
Updated Jan 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Upcoming 2020 Hollywood Movies [Dataset]. https://www.kaggle.com/datasets/thedevastator/upcoming-2020-hollywood-movies/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 21, 2023
Dataset provided by
Kaggle
Authors
The Devastator
Area covered
Hollywood
Description
Upcoming 2020 Hollywood Movies

Calendar, Production Companies, Cast and Crew

By Priyanka Dobhal [source]

About this dataset

This dataset provides a comprehensive list of the top upcoming Hollywood movies of 2021. With detailed information about each movie, including titles, production companies, cast and crew members, and sources for further reference, viewers can stay up to date on what's playing in theaters throughout the year. Discover beloved classics and modern-day blockbusters that will transport viewers to new worlds and stories for hours of entertainment!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

To use this dataset correctly here are the steps you should follow: - Read through the columns of the dataset to understand what is included in them such as “Month”, “Day”, “Title” and other columns to become familiar with the data. - Have an idea about which feature of Hollywood movies that you would like to explore further such as finding movies by a certain actors or directors or producers or release dates etcetera.
- Filter out columns needed and manipulate them according your requirements prior analysing so it will be easier to focus on valuable insights providing columns only that relates to your purpose of exploring according movie features chosen previously (ex; filter out casting director name column if isn’t related). - Analyse each row in dataset required carefully since different rows can provide important pieces of clues regarding movie features selected (ex; month column tend to tell us when a movie is usually released).

5 Once all analysis has been done feel free utilize visuals so we can draw significance relationships more efficiently between different categorical/numerical variables using charts & graphs etcetera .

6 Finally make sure that collected information relate directly towards problem statement given by conducting thorough validations from obtained results from above steps giving reliable & correct available insights related feature chosen initially making sense in context subjective scenario at hand

Research Ideas

Creating a timeline view of the up-coming Hollywood movie releases and their associated cast, crew and production company data.

Using production company data to analyze what genres, actors, and directors are popular this year.

Utilizing the cast and crew data to display the most experienced actor or filmmaker within each movie

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

See the dataset description for more information.

Columns

File: WIki_Movies.csv | Column name | Description | |:-----------------------|:-------------------------------------------------------------------------------| | Month | The month in which the movie is scheduled to be released. (String) | | Day | The day of the month in which the movie is scheduled to be released. (Integer) | | Title | The title of the movie. (String) | | Production company | The production company responsible for the movie. (String) | | Cast and crew | The names of the cast and crew involved in the movie. (String) | | Ref | The source from which the data was collected. (String) |

File: Hollywood Movies - 2020.csv | Column name | Description | |:-----------------------|:-----------------------------------------------------------------------| | Title | The title of the movie. (String) | | Production company | The production company responsible for the movie. (String) | | Cast and crew | The names of the cast and crew involved in the movie. (String) | | Opening | The date the movie is scheduled to be released in the US. (Date) | | Opening2 | The date the movie is scheduled to be released internationally. (Date) | | Ref. | The source from which the data was collected. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Priyanka Dobhal.
Harvard CS50 AI Degrees
kaggle.com
Updated Mar 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adil Shamim (2025). Harvard CS50 AI Degrees [Dataset]. https://www.kaggle.com/datasets/adilshamim8/cs50-ai
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 25, 2025
Dataset provided by
Kaggle
Authors
Adil Shamim
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset, inspired by the Harvard University CS50 AI Degrees project, is designed to help explore the fascinating “Six Degrees of Kevin Bacon” problem. The dataset contains detailed information linking actors through the movies they have starred in, allowing users to build and analyze connections in a film industry network.

Key Components:

People Data:
Contains records for each actor with a unique identifier, name, and birth year. This information is essential for mapping actors and resolving ambiguities when multiple actors share the same name.

Movies Data:
Provides a list of movies, each with a unique identifier, title, and release year. This data supports the connection points between actors.

Stars Data:
Acts as the relationship table, connecting actors to the movies in which they have appeared. Each entry links a person (actor) to a movie, forming the backbone of the network used to calculate the shortest path between any two actors.

Project Background:

This dataset underpins the classic problem of finding the shortest connection between two actors—similar to the “Six Degrees of Kevin Bacon” game. By framing this as a graph search problem, where actors are nodes and movies serve as the edges connecting them, users can implement algorithms like breadth-first search to determine the minimum steps required to connect any two individuals within the dataset.

Use Cases:

Graph and Network Analysis:
Build and visualize networks to understand the collaborative dynamics in the film industry.

Algorithm Design:
Implement and test search algorithms (such as breadth-first search) on a real-world dataset.

Data Exploration:
Explore connections between actors and movies, and discover interesting patterns or "hidden" relationships in the industry.

Whether you're looking to practice algorithm design, explore social networks, or simply dive into a classic problem with a modern dataset, this resource provides a robust platform for both learning and experimentation.
B
Global Media and Internet Concentration Project – Canada – Dataset 2022
borealisdata.ca
search.dataone.org
Updated Feb 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dwayne Winseck (2024). Global Media and Internet Concentration Project – Canada – Dataset 2022 [Dataset]. http://doi.org/10.5683/SP3/BGQTDG
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/BGQTDG
Dataset updated
Feb 3, 2024
Dataset provided by
Borealis
Authors
Dwayne Winseck
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Canada
Description
The Canadian contribution and data set prepared as part of the Global Media and Internet Concentration (GMIC) project offers an independent academic, empirical and data-driven analysis of a deceptively simple yet profoundly important question: have telecom, media and internet markets become more concentrated over time, or less? Media Ownership and Concentration is presented from more than a dozen sectors of the telecom-media-internet industries, including film, music and book industries. Note (22/01/2024): Small editorial changes were made throughout the report to clean up and improve the text. Small revisions to the estimates of the internet advertising revenue for some Canadian firms were also made to reflect newly available data. Those revisions were small and have no consequences for the analysis. Figures 1, 23, 25, 37, 40 and 41 were revised to reflect these changes.
Christopher Nolan Filmography and Scripts
kaggle.com
Updated Oct 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blesson Densil (2020). Christopher Nolan Filmography and Scripts [Dataset]. https://www.kaggle.com/blessondensil294/christopher-nolan-filmography-and-scripts/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 5, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Blesson Densil
Description
Context

Christopher Edward Nolan CBE is a British-American filmmaker known for making personal, distinctive films within the Hollywood mainstream. His directorial efforts have grossed more than US$4.9 billion in theatres worldwide and garnered 34 Oscar nominations and ten wins.

Born and raised in London, Nolan developed an interest in filmmaking from a young age. After studying English literature at University College London, he made his feature debut with Following (1998). Nolan gained international recognition with his second film, Memento (2000), for which he was nominated for the Academy Award for Best Original Screenplay. He transitioned from independent to studio filmmaking with Insomnia (2002), and found further critical and commercial success with The Dark Knight Trilogy (2005–2012), The Prestige (2006), and Inception (2010), which received eight Oscar nominations, including for Best Picture and Best Original Screenplay. This was followed by Interstellar (2014) and Dunkirk (2017), the latter of which earned him Academy Award nominations for Best Picture and Best Director. His eleventh feature, Tenet, was released in 2020.

Nolan's films are typically rooted in epistemological and metaphysical themes, exploring human morality, the construction of time, and the malleable nature of memory and personal identity. His work is permeated by mathematically inspired images and concepts, unconventional narrative structures, practical special effects, experimental soundscapes, large-format film photography, and materialistic perspectives. He has co-written several of his films with his brother Jonathan, and runs the production company Syncopy Inc. with his wife Emma Thomas.

Nolan has received many awards and honours. Time named him one of the 100 most influential people in the world in 2015, and in 2019, he was appointed Commander of the Order of the British Empire for his services to film. - Wikipedia

Content

This dataset Contains the Work of Christopher Nolan including his Short Films and Movies from 1989 to Date. Details of each film is mentioned in the Dataset which has all the details in Budget, Oscars, and his work in the Movie. The Dataset also contains the Scripts of the Movies he has Directed or Produced for analysis
350 000+ movies from themoviedb.org
kaggle.com
zip
Updated Oct 12, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stephanerappeneau (2017). 350 000+ movies from themoviedb.org [Dataset]. https://www.kaggle.com/stephanerappeneau/350-000-movies-from-themoviedborg
Explore at:
zip(70483259 bytes)Available download formats
Dataset updated
Oct 12, 2017
Authors
Stephanerappeneau
Description
Context

I love movies.

I tend to avoid marvel-transformers-standardized products, and prefer a mix of classic hollywood-golden-age and obscure polish artsy movies. Throw in an occasional japanese-zombie-slasher-giallo as an alibi. Good movies don't exist without bad movies.

On average I watch 200+ movies each year, with peaks at more than 500 movies. Nine years ago I started to log my movies to avoid watching the same movie twice, and also assign scores. Over the years, it gave me a couple insights on my viewing habits but nothing more than what a tenth-grader would learn at school.

I've recently suscribed to Netflix and it pains me to see the global inefficiency of recommendation systems for people like me, who mostly swear by "La politique des auteurs". It's a term coined by famous new-wave french movie critic André Bazin, meaning that the quality of a movie is essentially linked to the director and it's capacity to execute his vision with his crew. We could debate it depends on movie production pipeline, but let's not for now. Practically, what it means, is that I essentially watch movies from directors who made films I've liked.

I suspect Neflix calibrate their recommandation models taking into account the way the "average-joe" chooses a movie. A few months ago I had read a study based on a survey, showing that people chose a movie mostly based on genre (55%), then by leading actors (45%). Director or Release Date were far behind around 10% each. It is not surprising, since most people I know don't care who the director is. Lots of US blockbusters don't even mention it on the movie poster. I am aware that collaborative filtering is based on user proximity , which I believe decreases (or even eliminates) the need to characterize a movie. So here I'm more interested in content based filtering which is based on product proximity for several reasons :

Users tastes are not easily accessible. It is, after all, Netflix treasure chest

Movie offer on Netflix is so bad for someone who likes author's movies that it wouldn't help

Modeling a movie intrinsic qualities is a nice challenge

Enough.

"*The secret of getting ahead is getting started*" (Mark Twain)

https://img11.hostingpics.net/pics/117765networkgraph.png" alt="network graph">

Content

The primary source is www.themoviedb.org. If you watch obscure artsy romanian homemade movies you may find only 95% of your movies referenced...but for anyone else it should be in the 98%+ range.

movies details are from www.themoviedb.org API : movies/details

movies crew & casting are from www.themoviedb.org API : movies/credits

both can be joined by id

they contain all 350k movies up, from end of 19th century to august 2017. If you remove short movies from imdb you get similar amounts of movies.

I uploaded the program to retrieve incremental movie details on github : https://github.com/stephanerappeneau/scienceofmovies/tree/master/PycharmProjects/GetAllMovies (need a dev API key from themoviedb.org though)

I have tried various supervised (decision tree) / unsupervised (clustering, NLP) approaches described in the discussions, source code is on github : https://github.com/stephanerappeneau/scienceofmovies

As a bonus I've uploaded the bio summary from top 500 critically-acclaimed directors from wikipedia, for some interesting NLTK analysis

Here is overview of the available sources that I've tried :

• Imdb.com free csv dumps (ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/temporaryaccess/) are badly documented, incomplete, loosely structured and impossible to join/merge. There's an API hosted by Amazon Web Service : 1€ every 100 000 requests. With around 1 million movies, it could become expensive also features are bare. So I've searched for other sources.

• www.themoviedb.org is based on crowdsourcing and has an excellent API, limited to 40 requests every 10 seconds. It is quite generous, well documented, and enough to sweep the 450 000 movies in a few days. For my purpose, data quality is not significantly worse than imdb, and as imdb key is also included there's always the possibility to complete my dataset later (I actually did it)

• www.Boxofficemojo.com has some interesting budget/revenue figures (which are sorely lacking in both imdb & tmdb), but it actually tracks only a few thousand movies, mainly blockbusters. There are other professional sources that are used by film industry to get better predictive / marketing insights but that's beyond my reach for this experiment.

• www.wikipedia.com is an interesting source with no real cap on API calls, however it requires a bit of webscraping and for movies or directors the layout and quality varies a lot, so I suspected it'd get a lot of work to get insights so I put this source in lower priority.

• www.google.com will ban you after a few minutes of web scraping because their job is to scrap data from others, than sell it, duh.

• It's worth mentionning that there are a few dumps of Netflix anonymized user tastes on kaggle, because they've organised a few competitions to improve their recommendation models. https://www.kaggle.com/netflix-inc/netflix-prize-data

• Online databases are largely white anglo-saxon centric, meaning bollywood (India is the 2nd bigger producer of movies) offer is mostly absent from datasets. I'm fine with that, as it's not my cup of tea plus I lack domain knowledge. The sheer amount of indian movies would probably skew my results anyway (I don't want to have too many martial-arts-musicals in my recommendations ;-)). I have, however, tremendous respect for indian movie industry so I'd love to collaborate with an indian cinephile ! https://img11.hostingpics.net/pics/340226westerns.png" alt="Westerns">

Inspiration

Starting from there, I had multiple problem statements for both supervised / unsupervised machine learning

Can I program a tailored-recommendation system based on my own criteria ?

What are the characteristics of movies/directors I like the most ?

What is the probability that I will like my next movie ?

Can I find the data ?

One of the objectives of sharing my work here is to find cinephile data-scientists who might be interested and, hopefully, contribute or share insights :) Other interesting leads : use tagline for NLP/Clustering/Genre guessing, leverage on budget/revenue, link with other data sources using the imdb normalized title, etc.

https://img11.hostingpics.net/pics/977004matrice.png" alt="Correlation matrix">

Motivation, Disclaimer and Acknowledgements

I've graduated from an french engineering school, majoring in artificial intelligence, but that was 17 years ago right in the middle of A.I-winter. Like a lot of white male rocket scientists, I've ended up in one of the leading european investment bank, quickly abandonning IT development to specialize in trading/risk project management and internal politics. My recent appointment in the Data Office made me aware of recent breakthroughts in datascience, and I thought that developing a side project would be an excellent occasion to learn something new. Plus it'd give me a well-needed credibility which too often lack decision makers when it comes to datascience.

I've worked on some of the features with Cédric Paternotte, a fellow friend of mine who is a professor of philosophy of sciences in La Sorbonne. Working with someone with a different background seem a good idea for motivation, creativity and rigor.

Kudos to www.themoviedb.org or www.wikipedia.com sites, who really have a great attitude towards open data. This is typically NOT the case of modern-bigdata companies who mostly keep data to themselves to try to monetize it. Such a huge contrast with imdb or instagram API, which generously let you grab your last 3 comments at a miserable rate. Even if 15 years ago this seemed a mandatory path to get services for free, I predict one day governments will need to break this data monopoly.

[Disclaimer : I apologize in advance for my engrish (I'm french ^-^), any bad-code I've written (there are probably hundreds of way to do it better and faster), any pseudo-scientific assumption I've made, I'm slowly getting back in statistics and lack senior guidance, one day I regress a non-stationary time series and the day after I'll discover I shouldn't have, and any incorrect use of machine-learning models]

https://img11.hostingpics.net/pics/898068408x161poweredbyrectanglegreen.png" alt="powered by themoviedb.org">
UNESCO Culture Statistics, 1995-2017
datacatalogue.cessda.eu
Updated Nov 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UNESCO Institute for Statistics (2024). UNESCO Culture Statistics, 1995-2017 [Dataset]. http://doi.org/10.5255/UKDA-SN-8505-2
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-8505-2
Dataset updated
Nov 29, 2024
Dataset authored and provided by
UNESCO Institute for Statisticshttp://uis.unesco.org/
Time period covered
Jan 1, 1995 - Dec 31, 2017
Area covered
United Kingdom
Description
Abstract copyright UK Data Service and data collection copyright owner.

The UNESCO Culture dataset contains three tables:
The UNESCO Cultural Employment table includes comparable data to monitor the contribution of culture to economic and social development as well as the conditions of those engaged in cultural activities.
The UNESCO Institute for Statistics (UIS) Questionnaire on Feature Film Statistics yields data including all key countries in the film industry. The data provide a unique perspective on how different countries and regions are transforming traditional approaches to the art and industry of film-making, especially in video and digital formats. Key indicators focus on habits of film consumption by looking at the origin of films viewed, as well as the most popular films, based on the frequency of attendance. Other indicators focus on indoor cinemas per capita and average ticket price per capita, providing a good perspective on cinema infrastructure and access.
The UNESCO International Trade in Cultural Goods table is built using data from the UN Comtrade Database, which itself comprise of detailed global trade data.
The Oscar Award, 1927 - 2025
kaggle.com
Updated Mar 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raphael Fontes (2025). The Oscar Award, 1927 - 2025 [Dataset]. https://www.kaggle.com/unanimad/the-oscar-award/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 9, 2025
Dataset provided by
Kaggle
Authors
Raphael Fontes
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Please, If you enjoyed this dataset, don't forget to upvote it.

Context

The Academy Awards, also officially and popularly known as the Oscars, are awards for artistic and technical merit in the film industry. Given annually by the Academy of Motion Picture Arts and Sciences (AMPAS), the awards are an international recognition of excellence in cinematic achievements as assessed by the Academy's voting membership. The various category winners are awarded a copy of a golden statuette, officially called the "Academy Award of Merit", although more commonly referred to by its nickname "Oscar". The statuette depicts a knight rendered in Art Deco style.

Content

This file contains a scrape of The Academy Awards Database, recorded of past Academy Award winners and nominees between 1927 and 2025.

the_oscar_award.csv contains a view of the data consistent with past views of this Kaggle dataset.

full_data.csv contains the full data, with additional columns and parsing, imported from github

Acknowledgements

The awards data was scraped from the Official Academy Awards search site; nominees were listed with their name first and film following in some categories, such as Best Actor/Actress, and in the reverse for others.

Inspiration

Do the Academy Awards reflect the diversity of American films or are the #OscarsSoWhite?

Which actor/actress has received the most awards overall or in a single year?

Which film has received the most awards in a ceremony?

Which country received the most awards at a ceremony and overall?

Can you predict who will receive the awards next year?

Thank you @lopcio for the amazing helpful on fix this dataset missing values through the updates.
Data from: Personalized Recommendation Systems Dataset
kaggle.com
Updated Dec 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alfaris Bachmid (2024). Personalized Recommendation Systems Dataset [Dataset]. https://www.kaggle.com/datasets/alfarisbachmid/personalized-recommendation-systems-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 23, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Alfaris Bachmid
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Personalized Recommendation Systems Dataset (150,000 Entries)

This dataset is a fictional representation of user interactions within an e-commerce or streaming platform, created specifically for educational and training purposes. It simulates realistic user behavior and interactions to aid in developing and testing machine learning models for personalized recommendation systems. With 150,000 entries, it offers a rich variety of features suitable for building and evaluating algorithms in recommendation systems, user behavior analysis, and predictive modeling.

Dataset Features: 1. User_ID: A unique identifier for each user (e.g., User_1, User_2, etc.), representing individual profiles on the platform.
2. Item_ID: A unique identifier for each item, such as a product, movie, or song.
3. Category: The type of item interacted with (e.g., Electronics, Books, Music, Movies, etc.), providing insights into user preferences.
4. Rating: User-assigned ratings on a scale of 1.0 to 5.0, reflecting the level of satisfaction with the item.
5. Timestamp: The exact date and time of the interaction, useful for time-based analysis.
6. Price: The price of the item at the time of interaction, recorded in USD.
7. Platform: The platform or device used to interact with the system (e.g., Web, Mobile App, Smart TV, Tablet), capturing multi-device behavior.
8. Location: The geographic region of the user, categorized into areas such as North America, Europe, Asia, etc., for regional behavioral analysis.

Applications: This dataset is versatile and can be used for: - Collaborative Filtering Models: Harness user-item interaction data to recommend items based on similar users or items.
- Content-Based Recommendation Systems: Leverage item attributes to generate personalized recommendations.
- User Behavior Analysis: Uncover insights into user preferences, habits, and trends to inform marketing strategies.
- Predictive Modeling: Train machine learning models to predict user preferences or future interactions.

Important Note: This dataset is fictional and does not represent real-world data. It has been generated solely for educational and training purposes, making it ideal for students, researchers, and data scientists who want to practice building machine learning models without using sensitive or proprietary data.

Why Use This Dataset? 1. Diverse and Realistic Features: Simulates key aspects of user interaction in modern platforms.
2. Scalable Size: Provides sufficient data for training advanced machine learning models, ensuring robust validation.
3. Rich Metadata: Enables detailed analysis and multiple use cases, from recommendation systems to business analytics.

This dataset is a great resource for exploring personalized recommendations or enhancing machine learning skills in a practical and safe manner.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672

Film Circulation dataset

Explore at:

3 scholarly articles cite this dataset (View in Google Scholar)

csv, png, binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7887672

Dataset updated

Jul 12, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,

Clear search

Close search

Google apps

Main menu

Film Circulation dataset

Cinando film festival programming dataset

Global box office revenue 2014-2021, by format

Movie Rating Sites Report

Upcoming 2020 Hollywood Movies

Upcoming 2020 Hollywood Movies

Calendar, Production Companies, Cast and Crew

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Harvard CS50 AI Degrees

Key Components:

Project Background:

Use Cases:

Global Media and Internet Concentration Project – Canada – Dataset 2022

Christopher Nolan Filmography and Scripts

Context

Content

350 000+ movies from themoviedb.org

Context

Content

Inspiration

Motivation, Disclaimer and Acknowledgements

UNESCO Culture Statistics, 1995-2017

The Oscar Award, 1927 - 2025

Context

Content

Acknowledgements

Inspiration

Data from: Personalized Recommendation Systems Dataset

Film Circulation dataset