2 datasets found

TMDB 5000 Movie Dataset
kaggle.com
zip
Updated Feb 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarika (2025). TMDB 5000 Movie Dataset [Dataset]. https://www.kaggle.com/datasets/sarikaa9/tmdb-5000-movie-dataset
Explore at:
zip(1659098 bytes)Available download formats
Dataset updated
Feb 8, 2025
Authors
Sarika
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
What can we say about the success of a movie before it is released? Are there certain companies (Pixar?) that have found a consistent formula? Given that major films costing over $100 million to produce can still flop, this question is more important than ever to the industry. Film aficionados might have different interests. Can we predict which films will be highly rated, whether or not they are a commercial success?

This is a great place to start digging in to those questions, with data on the plot, cast, crew, budget, and revenues of several thousand films.
TMDB 5000 Movie Dataset
kaggle.com
zip
Updated Sep 28, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Movie Database (TMDb) (2017). TMDB 5000 Movie Dataset [Dataset]. https://www.kaggle.com/tmdb/tmdb-movie-metadata
Explore at:
zip(9317430 bytes)Available download formats
Dataset updated
Sep 28, 2017
Dataset provided by
The Movie Databasehttp://themoviedb.org/
Authors
The Movie Database (TMDb)
Description
Background

What can we say about the success of a movie before it is released? Are there certain companies (Pixar?) that have found a consistent formula? Given that major films costing over $100 million to produce can still flop, this question is more important than ever to the industry. Film aficionados might have different interests. Can we predict which films will be highly rated, whether or not they are a commercial success?

This is a great place to start digging in to those questions, with data on the plot, cast, crew, budget, and revenues of several thousand films.

Data Source Transfer Summary

We (Kaggle) have removed the original version of this dataset per a DMCA takedown request from IMDB. In order to minimize the impact, we're replacing it with a similar set of films and data fields from The Movie Database (TMDb) in accordance with their terms of use. The bad news is that kernels built on the old dataset will most likely no longer work.

The good news is that:

You can port your existing kernels over with a bit of editing. This kernel offers functions and examples for doing so. You can also find a general introduction to the new format here.

The new dataset contains full credits for both the cast and the crew, rather than just the first three actors.

Actor and actresses are now listed in the order they appear in the credits. It's unclear what ordering the original dataset used; for the movies I spot checked it didn't line up with either the credits order or IMDB's stars order.

The revenues appear to be more current. For example, IMDB's figures for Avatar seem to be from 2010 and understate the film's global revenues by over $2 billion.

Some of the movies that we weren't able to port over (a couple of hundred) were just bad entries. For example, this IMDB entry has basically no accurate information at all. It lists Star Wars Episode VII as a documentary.

Data Source Transfer Details

Several of the new columns contain json. You can save a bit of time by porting the load data functions from this kernel.

Even in simple fields like runtime may not be consistent across versions. For example, previous dataset shows the duration for Avatar's extended cut while TMDB shows the time for the original version.

There's now a separate file containing the full credits for both the cast and crew.

All fields are filled out by users so don't expect them to agree on keywords, genres, ratings, or the like.

Your existing kernels will continue to render normally until they are re-run.

If you are curious about how this dataset was prepared, the code to access TMDb's API is posted here.

New columns:

homepage

id

original_title

overview

popularity

production_companies

production_countries

release_date

spoken_languages

status

tagline

vote_average

Lost columns:

actor_1_facebook_likes

actor_2_facebook_likes

actor_3_facebook_likes

aspect_ratio

cast_total_facebook_likes

color

content_rating

director_facebook_likes

facenumber_in_poster

movie_facebook_likes

movie_imdb_link

num_critic_for_reviews

num_user_for_reviews

Open Questions About the Data

There are some things we haven't had a chance to confirm about the new dataset. If you have any insights, please let us know in the forums!

Are the budgets and revenues all in US dollars? Do they consistently show the global revenues?

This dataset hasn't yet gone through a data quality analysis. Can you find any obvious corrections? For example, in the IMDb version it was necessary to treat values of zero in the budget field as missing. Similar findings would be very helpful to your fellow Kagglers! (It's probably a good idea to keep treating zeros as missing, with the caveat that missing budgets much more likely to have been from small budget films in the first place).

Inspiration

Can you categorize the films by type, such as animated or not? We don't have explicit labels for this, but it should be possible to build them from the crew's job titles.

How sharp is the divide between major film studios and the independents? Do those two groups fall naturally out of a clustering analysis or is something more complicated going on?

Acknowledgements

This dataset was generated from The Movie Database API. This product uses the TMDb API but is not endorsed or certified by TMDb. Their API also provides access to data on many additiona...
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Sarika (2025). TMDB 5000 Movie Dataset [Dataset]. https://www.kaggle.com/datasets/sarikaa9/tmdb-5000-movie-dataset

TMDB 5000 Movie Dataset

Metadata on ~5,000 movies from TMDb

Explore at:

zip(1659098 bytes)Available download formats

Dataset updated

Feb 8, 2025

Authors

Sarika

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

What can we say about the success of a movie before it is released? Are there certain companies (Pixar?) that have found a consistent formula? Given that major films costing over $100 million to produce can still flop, this question is more important than ever to the industry. Film aficionados might have different interests. Can we predict which films will be highly rated, whether or not they are a commercial success?

This is a great place to start digging in to those questions, with data on the plot, cast, crew, budget, and revenues of several thousand films.

Clear search

Close search

Google apps

Main menu

TMDB 5000 Movie Dataset

TMDB 5000 Movie Dataset

Background

Data Source Transfer Summary

Data Source Transfer Details

Open Questions About the Data

Inspiration

Acknowledgements

TMDB 5000 Movie Dataset

Metadata on ~5,000 movies from TMDb