30 datasets found

"9,565 Top-Rated Movies Dataset"
kaggle.com
Updated Aug 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harshit@85 (2024). "9,565 Top-Rated Movies Dataset" [Dataset]. https://www.kaggle.com/datasets/harshit85/9565-top-rated-movies-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 19, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Harshit@85
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
About the Dataset

Title: 9,565 Top-Rated Movies Dataset

Description:
This dataset offers a comprehensive collection of 9,565 of the highest-rated movies according to audience ratings on the Movie Database (TMDb). The dataset includes detailed information about each movie, such as its title, overview, release date, popularity score, average vote, and vote count. It is designed to be a valuable resource for anyone interested in exploring trends in popular cinema, analyzing factors that contribute to a movie’s success, or building recommendation engines.

Key Features: - Title: The official title of each movie. - Overview: A brief synopsis or description of the movie's plot. - Release Date: The release date of the movie, formatted as YYYY-MM-DD. - Popularity: A score indicating the current popularity of the movie on TMDb, which can be used to gauge current interest. - Vote Average: The average rating of the movie, based on user votes. - Vote Count: The total number of votes the movie has received.

Data Source: The data was sourced from the TMDb API, a well-regarded platform for movie information, using the /movie/top_rated endpoint. The dataset represents a snapshot of the highest-rated movies as of the time of data collection.

Data Collection Process: - API Access: Data was retrieved programmatically using TMDb’s API. - Pagination Handling: Multiple API requests were made to cover all pages of top-rated movies, ensuring the dataset’s comprehensiveness. - Data Aggregation: Collected data was aggregated into a single, unified dataset using the pandas library. - Cleaning: Basic data cleaning was performed to remove duplicates and handle missing or malformed data entries.

Potential Uses: - Trend Analysis: Analyze trends in movie ratings over time or compare ratings across different genres. - Recommendation Systems: Build and train models to recommend movies based on user preferences. - Sentiment Analysis: Perform text analysis on movie overviews to understand common themes and sentiments. - Statistical Analysis: Explore the relationship between popularity, vote count, and average ratings.

Data Format: The dataset is provided in a structured tabular format (e.g., CSV), making it easy to load into data analysis tools like Python, R, or Excel.

Usage License: The dataset is shared under [appropriate license], ensuring that it can be used for educational, research, or commercial purposes, with proper attribution to the data source (TMDb).

This description provides a clear and detailed overview, helping potential users understand the dataset's content, origin, and potential applications.
The Ultimate Film Statistics Dataset - for ML🏆🎬
kaggle.com
Updated Jul 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alessandro Lo Bello (2023). The Ultimate Film Statistics Dataset - for ML🏆🎬 [Dataset]. https://www.kaggle.com/datasets/alessandrolobello/the-ultimate-film-statistics-dataset-for-ml/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 9, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Alessandro Lo Bello
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Description: This dataset provides comprehensive movie statistics compiled from multiple sources, including Wikipedia, The Numbers, and IMDb. It offers a rich collection of information and insights into various aspects of movies, such as movie titles, production dates, genres, runtime minutes, director information, average ratings, number of votes, approval index, production budgets, domestic gross earnings, and worldwide gross earnings.

The dataset combines data scraped from Wikipedia, which includes details about movie titles, production dates, genres, runtime minutes, and director information, with data from The Numbers, a reliable source for box office statistics. Additionally, IMDb data is integrated to provide information on average ratings, number of votes, and other movie-related attributes.

With this dataset, users can analyze and explore trends in the film industry, assess the financial success of movies, identify popular genres, and investigate the relationship between average ratings and box office performance. Researchers, movie enthusiasts, and data analysts can leverage this dataset for various purposes, including data visualization, predictive modeling, and deeper understanding of the movie landscape.

Features: - Movie_title - Production_date - Genres - Runtime_minutes - Director_name (primaryName) - Director_professions (primaryProfession) - Director_birthYear - Director_deathYear - Movie_averageRating : refers to the average rating given by online users for a particular movie - Movie_numberOfVotes : refers to the number of votes given by online users for a particular movie - Approval_Index :is a normalized indicator (on scale 0-10) calculated by multiplying the logarithm of the number of votes by the average users rating. It provides a concise measure of a movie's overall popularity and approval among online viewers, penalizing both films that got too few reviews and blockbusters that got too many. - Production_budget ( $) - Domestic_gross ($) - Worldwide_gross ($)

Potential Applications:

Box office analysis: Analyze the relationship between production budgets, domestic and worldwide gross earnings, and profitability. Genre analysis: Identify the most popular genres based on movie counts and analyze their performance. Rating analysis: Explore the relationship between average ratings, number of votes, and financial success. Director analysis: Investigate the impact of directors on movie ratings and financial performance. Time-based analysis: Study movie trends over different production years and observe changes in production budgets, box office earnings, and genre preferences. By utilizing this dataset, users can gain valuable insights into the movie industry and uncover patterns that can inform decision-making, market research, and creative strategies.
t
IMDb Movie Review Dataset - Dataset - LDM
service.tib.eu
Updated Nov 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). IMDb Movie Review Dataset - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/imdb-movie-review-dataset
Explore at:
Dataset updated
Nov 25, 2024
Description
The IMDb movie review dataset consists of a balanced sample of 25,000 positive and 25,000 negative reviews, divided into equal-size train and test sets, with an average document length of 231 words.
o
Popular TMDB Films Metadata Dataset
opendatabay.com
.undefined
Updated Jul 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Popular TMDB Films Metadata Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/4dde40e9-76eb-4270-983d-c1ba4b8fe72d
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 2, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Entertainment & Media Consumption
Description
This dataset contains metadata for the top 10,000 most popular movies available on The Movie Database (TMDB). TMDB is a widely used online platform and community providing extensive details on films, TV shows, and related content. Users can browse and search for titles, accessing information such as cast, crew, synopses, and ratings. This dataset is designed for data analysts, researchers, and developers keen on examining movie popularity and attributes. It is a valuable resource for various analyses, including exploring trends in movie genres over time, identifying patterns in budget versus revenue, and evaluating the impact of different attributes on a film's popularity. The data was gathered from TMDB's public API and has undergone thorough cleaning and preprocessing to enhance its quality and usability.

Columns

id: A unique identifier for each movie within the TMDB database.

title: The title of the movie.

release_date: The date on which the movie was released.

vote_average: The average rating given to the movie by TMDB users.

vote_count: The total number of votes cast for the movie on TMDB.

popularity: A score assigned to the movie by TMDB, based on user engagement metrics.

Distribution

This dataset comprises metadata for the top 10,000 most popular movies from The Movie Database. Specific numbers for rows or records beyond this top count are not available. The data has been meticulously crafted from raw information obtained via TMDB's public API and subsequently cleaned and preprocessed.

Usage

Ideal applications for this dataset include: * Analysing trends in movie genres over time. * Identifying correlations between movie budget, revenue, and popularity. * Developing and testing movie recommendation systems. * Exploring the impact of different attributes on a movie's success. * Academic research into film industry dynamics and audience reception.

Coverage

The dataset's geographic coverage is Global, reflecting the worldwide reach of movies and TMDB's user base. It focuses on the top 10,000 most popular movies, implying a snapshot of current or recent popularity without a specific historical time range for the films themselves. No specific demographic scope for the data is provided, but it reflects engagement from TMDB users generally.

License

CC0

Who Can Use It

This dataset is primarily intended for: * Data Analysts: To scrutinise and analyse movie popularity and attributes. * Researchers: For academic studies on film trends, audience behaviour, and industry patterns. * Developers: To build and test applications such as movie recommendation engines or data visualisations.

Dataset Name Suggestions

TMDB Top Movies

Popular TMDB Films Metadata

Movie Popularity Dataset

TMDB Film Attributes

Attributes

Original Data Source: TMDB_top_rated_movies
s
Moviegalaxies – Social Networks in Movies
marketplace.sshopencloud.eu
dataverse.harvard.edu
+1more
Updated Feb 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Moviegalaxies – Social Networks in Movies [Dataset]. http://doi.org/10.7910/DVN/T4HBA3
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/T4HBA3
Dataset updated
Feb 11, 2022
Description
This repository contains network graphs and network metadata from Moviegalaxies, a website providing network graph data from about 773 films (1915–2012). The data includes individual network graph data in Graph Exchange XML Format and descriptive statistics on measures such as clustering coefficient, degree, density, diameter, modularity, average path length, the total number of edges, and the total number of nodes.
TMDb Top 10,000 Popular Movies Dataset
kaggle.com
Updated Apr 7, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Balaka Biswas (2020). TMDb Top 10,000 Popular Movies Dataset [Dataset]. https://www.kaggle.com/balaka18/tmdb-top-10000-popular-movies-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 7, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Balaka Biswas
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Introduction

This is dataset of the 10,000 most popular movies across the world, irrespective of language and recency. These have been extracted using TMDb API.

About the Dataset

What is TMDB's API? The closed-source API service is for those people interested in using their movies, TV shows or actor images and/or data in their application. TMDb's API is a system that they provide for developers and their team to programmatically fetch and use TMDb's data and/or images. Their API is free to use as long as you attribute TMDb as the source of the data and/or images. Also, they update their API from time to time.

This dataset lists 10,000 most popular movies across the globe. Information held inside the dataset - A. Dataset 1 : Movies dataset - 1. title - Title of the Movie in English. 2. overview - A small summary of the plot. 3. original_lang - Original language it was shot in. 4. rel_date - Date of release. 5. popularity - Popularity. 6. vote_count - Votes received. 7. vote_average - Average of all votes received.

B. Dataset 2 : Genres dataset 1. id 2. Movie ID 3. Genre
A
‘Movielens dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Mar 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘Movielens dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-movielens-dataset-5c80/463704b7/?iid=001-454&v=presentation
Explore at:
Dataset updated
Mar 30, 2020
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Movielens dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ayushimishra2809/movielens-dataset on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

GroupLens Research has collected and made available rating data sets from the MovieLens web site (http://movielens.org). The data sets were collected over various periods of time, depending on the size of the set. A recommender system is a simple algorithm whose aim is to provide the most relevant information to a user by discovering patterns in a dataset. The algorithm rates the items and shows the user the items that they would rate highly.

Content

The data consists of 105339 ratings applied over 10329 movies. The average rating is 3.5 and minimum and maximum rating is 0.5 and 5 respectively. There are 668 user who has given their ratings for 149532 movies.

Inspiration

Can you make a movie recommender system using any type of recommedation algorithms like content based, collaborative filtering etc?

--- Original source retains full ownership of the source dataset ---
o
Indonesian Film Database (IMDb)
opendatabay.com
.undefined
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Indonesian Film Database (IMDb) [Dataset]. https://www.opendatabay.com/data/dataset/e6c24dd2-f5c7-4abf-83f4-ac3deb784967
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Entertainment & Media Consumption
Description
This dataset contains details for 1262 Indonesian movies, compiled to offer insights into the country's film industry. It was assembled using an IMDb-Scraper and then converted and cleaned into a CSV file, providing a structured collection of movie information [1]. The data was collected from IMDb.com [1].

Columns

title: The primary title of the movie [2].

year: The release year of the movie, with values ranging from 1926 to 2020 [2].

description: A textual summary or plot outline for the movie [2].

genre: Categories that describe the movie's style or content, such as Drama or Comedy [2, 3].

rating: The age rating certification applied to the movie, for example, '13+' [2, 3].

users_rating: The average rating given by IMDb users, typically ranging from 1.2 to 9.4 [2, 3].

votes: The total count of votes received from IMDb users, with values varying from 5 to 187,000 [2, 4].

languages: The language(s) in which the movie is primarily presented, notably Indonesian and English [2, 4].

directors: The individual(s) credited with directing the movie, including names like Nayato Fio Nuala [2, 4].

actors: The main cast members or performers featured in the movie [2].

runtime: The duration of the movie [1].

Distribution

The dataset is provided in a CSV file format [1]. It includes 1262 unique movie records or rows [1, 2].

Usage

This dataset is ideal for: * Exploratory data analysis of Indonesian cinema trends [1]. * Natural Language Processing (NLP) tasks on movie descriptions [1]. * Analysing movie characteristics such as genre distribution, rating trends, and language prevalence. * Studying the impact of directors and actors within the Indonesian film landscape.

Coverage

The dataset specifically covers Indonesian movies [1, 2]. The time range for these movies spans from 1926 to 2020 [2].

License

CCO

Who Can Use It

Data Analysts and Scientists: For statistical analysis, trend identification, and data visualisations related to movies.

Researchers: Studying film history, cultural impact of cinema, or market analysis within the Indonesian context.

Natural Language Processing Specialists: For training models on movie descriptions, sentiment analysis, or content categorisation.

Film Enthusiasts and Critics: To explore movie characteristics, ratings, and directorial styles.

Dataset Name Suggestions

IMDb Indonesian Movies Data

Indonesian Film Database (IMDb)

IMDb Indonesian Cinema

Indonesian Movie Catalogue (IMDb)

Attributes

Original Data Source: IMDb Indonesian Movies
A
‘Hollywood Theatrical Market Synopsis 1995 to 2021’ analyzed by Analyst-2
analyst-2.ai
Updated Nov 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Hollywood Theatrical Market Synopsis 1995 to 2021’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-hollywood-theatrical-market-synopsis-1995-to-2021-3384/833e9cc6/?iid=006-900&v=presentation
Explore at:
Dataset updated
Nov 15, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Hollywood Theatrical Market Synopsis 1995 to 2021’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/johnharshith/hollywood-theatrical-market-synopsis-1995-to-2021 on 28 January 2022.

--- Dataset description provided by original source is as follows ---

https://images7.alphacoders.com/116/thumb-350-1165584.jpg" alt="Hollywood Films">

Context

This Dataset contains the data of market analysis built on The Numbers unique categorization system, which uses 6 different criteria to identify a movie. All movies released since 1995 are categorized according to the following attributes: Creative type (factual, contemporary fiction, fantasy etc.), Source (book, play, original screenplay etc.), Genre (drama, horror, documentary etc.), MPAA rating, Production method (live action, digital animation etc.) and Distributor. In order to provide a fair comparison between movies released in different years, all rankings are based on ticket sales, which are calculated using average ticket prices announced by the MPAA in their annual state of the industry report.

Content

The Dataset contains various files illustrating statistics such as annual ticket sales, highest grossers each year since 1995, top grossing creative types, top grossing distributors, top grossing genres, top grossing MPAA ratings, top grossing sources, top grossing production methods and the number of wide releases each year by various distributors.

Acknowledgements

The data was obtained from The Numbers website. Their theatrical market pages are based on the domestic theatrical market performance only. The domestic market is defined as the North American movie region (consisting of the United States, Canada, Puerto Rico and Guam). This data can be found from the website https://www.the-numbers.com/market/ with detailed analysis.

Inspiration

2020 and 2021 have been rough years for the movie industry, and being a huge movie fanatic inspired me to share a dataset showing the exponential growth of box office collections as well as ticket sales over time (and the decline after 2020 due to the Covid-19 pandemic) indirectly indicating the quality of modern day films. This Dataset can also be used to study the genres which attract audience the most and encourage one to create an amazing genre specific plot in order to take one step closer to becoming the next most successful director!

--- Original source retains full ownership of the source dataset ---
o
NLP Corpus of Spanish Film Reviews
opendatabay.com
.undefined
Updated Jul 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). NLP Corpus of Spanish Film Reviews [Dataset]. https://www.opendatabay.com/data/ai-ml/a6f15f7b-0410-45d0-9b90-1bb16285f90f
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 5, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Entertainment & Media Consumption
Description
This dataset offers a valuable corpus of film reviews in Spanish, specifically designed to support Natural Language Processing (NLP) research and development. In a field that often focuses heavily on the English language, this collection provides a much-needed resource for understanding natural language within the Spanish context. It comprises user-generated criticisms of over 50 highly relevant Spanish films, sourced from the Filmaffinity.com website. The aim is to foster knowledge sharing in Spanish NLP among users.

Columns

film_name: The title of the film.

gender: The genre of the film (e.g., comedy, horror, action).

film_avg_rate: The average rating of the film, based on votes from all users.

review_rate: The specific rating assigned by the user who authored the review.

review_title: The title given to the individual film review.

review_text: The full text of the film criticism itself. It is important to note that the data file uses a double pipe "||" as a separator, which may cause display issues with extra columns on some platforms, such as Kaggle.

Distribution

The dataset is structured in a tabular format, typically available as a CSV file. It contains reviews related to more than 50 Spanish films. Specific counts for rows or records are not provided; however, the file's delimiter is a double pipe "||".

Usage

This dataset is ideally suited for various applications in Natural Language Processing (NLP) focusing on the Spanish language. It can be used for: * Developing and testing NLP models for sentiment analysis on Spanish text. * Training machine learning models for text classification or topic modelling. * Learning and experimenting with NLP techniques using a real-world Spanish corpus. * Facilitating knowledge exchange and collaborative projects on Spanish NLP.

Coverage

The dataset focuses exclusively on Spanish films and Spanish language reviews. The films included are those considered most relevant at the time the dataset was created, ensuring a relevant and current body of criticism from Filmaffinity.com users. There is no specified time range beyond the creation date for the included films.

License

CC0

Who Can Use It

This dataset is particularly beneficial for: * Spanish-speaking Kaggle users looking to contribute to and learn from NLP projects in their native language. * Researchers and students in artificial intelligence, linguistics, or data science focusing on NLP within the Spanish context. * Developers building applications that require understanding or processing Spanish text, especially in the entertainment or media sectors. * Anyone interested in analysing user-generated content and opinions on films in Spanish.

Dataset Name Suggestions

Spanish Film Review Dataset

Filmaffinity Spanish Movie Criticisms

NLP Corpus of Spanish Film Reviews

Spanish Language Movie Reviews

Attributes

Original Data Source: Críticas películas filmaffinity en Español
f
Dataset for "Learning heterogeneous reaction kinetics from X-ray movies...
figshare.com
bin
Updated Sep 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hongbo Zhao; Haitao D. Deng; Alexander E. Cohen; Jongwoo Lim; Yiyang Li; Dimitrios Fraggedakis; Benben Jiang; Brian D. Storey; William C. Chueh; Richard D. Braatz; Martin Z. Bazant (2023). Dataset for "Learning heterogeneous reaction kinetics from X-ray movies pixel-by-pixel" [Dataset]. http://doi.org/10.6084/m9.figshare.23682429.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23682429.v1
Dataset updated
Sep 13, 2023
Dataset provided by
figshare
Authors
Hongbo Zhao; Haitao D. Deng; Alexander E. Cohen; Jongwoo Lim; Yiyang Li; Dimitrios Fraggedakis; Benben Jiang; Brian D. Storey; William C. Chueh; Richard D. Braatz; Martin Z. Bazant
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the raw data for the paper "Learning heterogeneous reaction kinetics from X-ray movies pixel-by-pixel". The MAT-file contains two variables: 1. 'stxm' is a structure array that contains all the STXM data. Each entry contains the STXM images scanned over one region, which may contain one or two particles. stxm contains the following fields:

name: the name of the scanned region. scan: the scan number for all frames associated with this region. date: the date of the experiment. time: the time of the scan. lfpmat: the intensity of LiFePO4 (the variable 'a' in SI Eq. 112). The first two dimensions are image coordinate (x and y). The third dimension is the frame index, whose length is equal to the length of 'scan'. fpmat: the intensity of FePO4 (the variable 'b' in SI Eq. 112). The first two dimensions are image coordinate (x and y). The third dimension is the frame index, whose length is equal to the length of 'scan'. segment: a cell in which each entry is frame indices associated with a charge or discharge half cycle. boundary: a cell in which the i-th entry is the image coordinates of the boundary of particle i in this region. The first and second columns are the x and y coordinates, respectively. roi: a cell in which the i-th entry is the region-of-interest (ROI) of particle i in this region. The ROI is a logical array in which 1 indicates a pixel inside the particle and 0 indicates a pixel outside the particle. Area: the area (in number of pixels) of the particles in this region. Centroid: the image coordinate of centroid of the particles in this region. Each row corresponds to a particle. The first and second columns are the x and y coordinates, respectively. Orientation: the angle between the particles' major axis and the x-axis in degrees. MajorAxisLength: the length of the particles' major axis defined by the second moment of the ROI. MinorAxisLength: the length of the particles' major axis defined by the second moment of the ROI. Crate: the global C-rate of the charge or discharge half cycle(s) measured for the entire cell. Its length is equal to the length of 'segment'. avg: a cell in which the i-th entry is the average Li fraction of particle i in all the frames. var: a cell in which the i-th entry is the variance of the Li fraction of particle i in all the frames. avgrate: the average local C-rate of the particles defined to be the change in average Li fraction over the duration of the half-cycle. Each row corresponds a particle. Each column corresponds to a half-cycle. avgrate(i,j) is the average local C-rate of particle i during half-cycle j. inversion_Li_frac: the simulated Li fraction from the inversion result as shown in Fig. 2, SI Fig. 57, and SI Movie 1. inversion_Li_frac{i}{j} is the simulated Li fraction field of particle i during half-cycle j. The first two dimensions are image coordinates (x and y) (the sizes are the same as the first two dimensions of lfpmat and fpmat). The third dimension is the frame index whose length is equal to the length of segment{j}. The value outside the ROI is NaN. inversion_k: the inverted heterogeneity k(x,y) as shown in Fig. 3b and Fig. 55. inverson_k{i} is the inverted k(x,y) of particle i. The two dimensions are image coordinates (x and y) (the sizes are the same as the first two dimensions of lfpmat and fpmat). The value outside the ROI is NaN.

'aem' is a structure array that contains all the AEM data. Each entry contains the AEM image of a particle. aem contains the following fields:

carbon: the AEM carbon signal I(x,y). name: the name of the scanned region that the particle is in. region: the index of the particle in the scanned region. augercp: coordinates of the control points in the AEM image. The first and second columns are the x and y coordinates, respectively. stxmcp: coordinates of the corresponding control points in the corresponding STXM image. The first and second columns are the x and y coordinates, respectively. 'augercp' and 'stxmcp' are used for image registration between AEM and STXM. auger2stxm: an affine2d object that determines the affine transformation for registration from AEM to STXM images. It is defined based on the control points. tree: the index in the 'stxm' structure array that this particle corresponds to. roi: the ROI of the particle.
P
MAD Dataset
paperswithcode.com
Updated Nov 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mattia Soldan; Alejandro Pardo; Juan León Alcázar; Fabian Caba Heilbron; Chen Zhao; Silvio Giancola; Bernard Ghanem (2021). MAD Dataset [Dataset]. https://paperswithcode.com/dataset/mad
Explore at:
Dataset updated
Nov 30, 2021
Authors
Mattia Soldan; Alejandro Pardo; Juan León Alcázar; Fabian Caba Heilbron; Chen Zhao; Silvio Giancola; Bernard Ghanem
Description
MAD (Movie Audio Descriptions) is an automatically curated large-scale dataset for the task of natural language grounding in videos or natural language moment retrieval. MAD exploits available audio descriptions of mainstream movies. Such audio descriptions are redacted for visually impaired audiences and are therefore highly descriptive of the visual content being displayed. MAD contains over 384,000 natural language sentences grounded in over 1,200 hours of video, and provides a unique setup for video grounding as the visual stream is truly untrimmed with an average video duration of 110 minutes. 2 orders of magnitude longer than legacy datasets.

Take a look at the paper for additional information.

From the authors on availability: "Due to copyright constraints, MAD’s videos will not be publicly released. However, we will provide all necessary features for our experiments’ reproducibility and promote future research in this direction"
IMDb India Movies
kaggle.com
Updated Jun 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adrian McMahon (2021). IMDb India Movies [Dataset]. https://www.kaggle.com/adrianmcmahon/imdb-india-movies/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 18, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Adrian McMahon
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
India
Description
Task Details

Every dataset has a story and this set is pulled from IMDb.com of all the Indian movies on the platform. Clean this data by removing missing values or adding average values this process will help to manipulate the data to help with your EDA.

Analyze data and provide some trends.

Year with best rating

Does length of movie have any impact with the rating?

Top 10 movies according to rating per year and overall.

Number of popular movies released each year.

Counting the number of votes which movies preformed better in rating per year and overall.

Any other trends or future prediction you may have

Which director directed the most movies

Which actor starred in the movie

Any other trends you can find

Thank you for viewing my dataset, looking forward to seeing some codes.
Data from: Twist - torsion coupling in beating axonemes
zenodo.org
Updated Jan 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Veikko F. Geyer; Veikko F. Geyer (2025). Twist - torsion coupling in beating axonemes [Dataset]. http://doi.org/10.5281/zenodo.13909287
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.13909287
Dataset updated
Jan 23, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Veikko F. Geyer; Veikko F. Geyer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset published here was used to measure a high resolution 3D wavefom of isolated and reactivated axonemes from Chlamydomonas reinhardtii.
Note: This dataset contains a motion-blur correction appled to to the data in doi: https://doi.org/10.1101/2024.03.18.585533 and code that details how the 3D average waveform was calulated.

It was further used to show twist-torsion coupling in these axonemes (doi:10.1038/s41567-025-02783-2).

The data is organized in seven folders:

1) high resolution average 3D waveform of isolated and reactivated axonemes from Chlamydomonas Reinhardtii.
Data files (MATLAB and txt format) contain the 3D coordinates (along the 3D arc-length) of 32 axonemal shapes that comprise one beat-cycle.
A corresponding txt file describes the details of the dataset.

2) 3D waveforms of single isolated and reactivated axonemes from Chlamydomonas Reinhardtii.
Data files (MATLAB and txt format) contain the 3D shapes of 17 individual axonemes obtained from defocused darkfield-microsopy images.
A corresponding txt file describes the details of the dataset.

3) Image Raw Data of single isolated and reactivated axonemes used to reconstruct the 3D waveform
Movie files (multi-layer tif) of reactivated axonemes imaged with defocused-darkfield-microscopy.
A corresponding txt file describes the details of the dataset.

4) Calibration of defocused darkfield-microscopy.
Data file (MATLAB) contains the relationship between the z-position relative to the focal plane and the full-width-at-half-maximum (FWHM) of the axoneme signal, measured normal to the centerline as well as the z-stack of imges (multi-layer tif) used to extract this relation.
A corresponding txt file describes the details of the dataset.

5) Distance between gold nano paricle (GNP) and the axonemal centerline as a function of the beat cycle
Data file (MATLAB) contains 20 measurements of d_C (where d_C is the normal distance between the center position of the GNP and the axoneme centerline in 2D images) as a function of time. A corresponding txt file describes the details of the dataset.

6) Image Raw Data of single isolated and reactivated axonemes with attached GNPs used to measure d_C.
Movie files (multi-layer tif) of reactivated axonemes with attached gold nano particles (GNPs) imaged with darkfield-microscopy.
A corresponding txt file describes the details of the dataset.

7) Code to calculate the average 3D waveform from defocused darkfield movies
MATLAB code used to calculate the average waveform of an axoneme that was recorded with high-speed defocused darkfield microscopy.
A corresponding pdf file (Manual.pdf) describes the details of the procedure in 6 steps.
Naturalistic Neuroimaging Database
openneuro.org
Updated Apr 20, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarah Aliko; Jiawen Huang; Florin Gheorghiu; Stefanie Meliss; Jeremy I Skipper (2021). Naturalistic Neuroimaging Database [Dataset]. http://doi.org/10.18112/openneuro.ds002837.v2.0.0
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds002837.v2.0.0
Dataset updated
Apr 20, 2021
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
Sarah Aliko; Jiawen Huang; Florin Gheorghiu; Stefanie Meliss; Jeremy I Skipper
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Overview

The Naturalistic Neuroimaging Database (NNDb v2.0) contains datasets from 86 human participants doing the NIH Toolbox and then watching one of 10 full-length movies during functional magnetic resonance imaging (fMRI).The participants were all right-handed, native English speakers, with no history of neurological/psychiatric illnesses, with no hearing impairments, unimpaired or corrected vision and taking no medication. Each movie was stopped in 40-50 minute intervals or when participants asked for a break, resulting in 2-6 runs of BOLD-fMRI. A 10 minute high-resolution defaced T1-weighted anatomical MRI scan (MPRAGE) is also provided.

The NNDb V2.0 is now on Neuroscout, a platform for fast and flexible re-analysis of (naturalistic) fMRI studies. See: https://neuroscout.org/

v2.0 Changes

Overview

We have replaced our own preprocessing pipeline with that implemented in AFNI’s afni_proc.py, thus changing only the derivative files. This introduces a fix for an issue with our normalization (i.e., scaling) step and modernizes and standardizes the preprocessing applied to the NNDb derivative files. We have done a bit of testing and have found that results in both pipelines are quite similar in terms of the resulting spatial patterns of activity but with the benefit that the afni_proc.py results are 'cleaner' and statistically more robust.

Normalization

Emily Finn and Clare Grall at Dartmouth and Rick Reynolds and Paul Taylor at AFNI, discovered and showed us that the normalization procedure we used for the derivative files was less than ideal for timeseries runs of varying lengths. Specifically, the 3dDetrend flag -normalize makes 'the sum-of-squares equal to 1'. We had not thought through that an implication of this is that the resulting normalized timeseries amplitudes will be affected by run length, increasing as run length decreases (and maybe this should go in 3dDetrend’s help text). To demonstrate this, I wrote a version of 3dDetrend’s -normalize for R so you can see for yourselves by running the following code:

# Generate a resting state (rs) timeseries (ts) # Install / load package to make fake fMRI ts # install.packages("neuRosim") library(neuRosim) # Generate a ts ts.rs <- simTSrestingstate(nscan=2000, TR=1, SNR=1) # 3dDetrend -normalize # R command version for 3dDetrend -normalize -polort 0 which normalizes by making "the sum-of-squares equal to 1" # Do for the full timeseries ts.normalised.long <- (ts.rs-mean(ts.rs))/sqrt(sum((ts.rs-mean(ts.rs))^2)); # Do this again for a shorter version of the same timeseries ts.shorter.length <- length(ts.normalised.long)/4 ts.normalised.short <- (ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))/sqrt(sum((ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))^2)); # By looking at the summaries, it can be seen that the median values become larger summary(ts.normalised.long) summary(ts.normalised.short) # Plot results for the long and short ts # Truncate the longer ts for plotting only ts.normalised.long.made.shorter <- ts.normalised.long[1:ts.shorter.length] # Give the plot a title title <- "3dDetrend -normalize for long (blue) and short (red) timeseries"; plot(x=0, y=0, main=title, xlab="", ylab="", xaxs='i', xlim=c(1,length(ts.normalised.short)), ylim=c(min(ts.normalised.short),max(ts.normalised.short))); # Add zero line lines(x=c(-1,ts.shorter.length), y=rep(0,2), col='grey'); # 3dDetrend -normalize -polort 0 for long timeseries lines(ts.normalised.long.made.shorter, col='blue'); # 3dDetrend -normalize -polort 0 for short timeseries lines(ts.normalised.short, col='red');

Standardization/modernization

The above individuals also encouraged us to implement the afni_proc.py script over our own pipeline. It introduces at least three additional improvements: First, we now use Bob’s @SSwarper to align our anatomical files with an MNI template (now MNI152_2009_template_SSW.nii.gz) and this, in turn, integrates nicely into the afni_proc.py pipeline. This seems to result in a generally better or more consistent alignment, though this is only a qualitative observation. Second, all the transformations / interpolations and detrending are now done in fewers steps compared to our pipeline. This is preferable because, e.g., there is less chance of inadvertently reintroducing noise back into the timeseries (see Lindquist, Geuter, Wager, & Caffo 2019). Finally, many groups are advocating using tools like fMRIPrep or afni_proc.py to increase standardization of analyses practices in our neuroimaging community. This presumably results in less error, less heterogeneity and more interpretability of results across studies. Along these lines, the quality control (‘QC’) html pages generated by afni_proc.py are a real help in assessing data quality and almost a joy to use.

New afni_proc.py command line

The following is the afni_proc.py command line that we used to generate blurred and censored timeseries files. The afni_proc.py tool comes with extensive help and examples. As such, you can quickly understand our preprocessing decisions by scrutinising the below. Specifically, the following command is most similar to Example 11 for ‘Resting state analysis’ in the help file (see https://afni.nimh.nih.gov/pub/dist/doc/program_help/afni_proc.py.html): afni_proc.py \ -subj_id "$sub_id_name_1" \ -blocks despike tshift align tlrc volreg mask blur scale regress \ -radial_correlate_blocks tcat volreg \ -copy_anat anatomical_warped/anatSS.1.nii.gz \ -anat_has_skull no \ -anat_follower anat_w_skull anat anatomical_warped/anatU.1.nii.gz \ -anat_follower_ROI aaseg anat freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \ -anat_follower_ROI aeseg epi freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \ -anat_follower_ROI fsvent epi freesurfer/SUMA/fs_ap_latvent.nii.gz \ -anat_follower_ROI fswm epi freesurfer/SUMA/fs_ap_wm.nii.gz \ -anat_follower_ROI fsgm epi freesurfer/SUMA/fs_ap_gm.nii.gz \ -anat_follower_erode fsvent fswm \ -dsets media_?.nii.gz \ -tcat_remove_first_trs 8 \ -tshift_opts_ts -tpattern alt+z2 \ -align_opts_aea -cost lpc+ZZ -giant_move -check_flip \ -tlrc_base "$basedset" \ -tlrc_NL_warp \ -tlrc_NL_warped_dsets \ anatomical_warped/anatQQ.1.nii.gz \ anatomical_warped/anatQQ.1.aff12.1D \ anatomical_warped/anatQQ.1_WARP.nii.gz \ -volreg_align_to MIN_OUTLIER \ -volreg_post_vr_allin yes \ -volreg_pvra_base_index MIN_OUTLIER \ -volreg_align_e2a \ -volreg_tlrc_warp \ -mask_opts_automask -clfrac 0.10 \ -mask_epi_anat yes \ -blur_to_fwhm -blur_size $blur \ -regress_motion_per_run \ -regress_ROI_PC fsvent 3 \ -regress_ROI_PC_per_run fsvent \ -regress_make_corr_vols aeseg fsvent \ -regress_anaticor_fast \ -regress_anaticor_label fswm \ -regress_censor_motion 0.3 \ -regress_censor_outliers 0.1 \ -regress_apply_mot_types demean deriv \ -regress_est_blur_epits \ -regress_est_blur_errts \ -regress_run_clustsim no \ -regress_polort 2 \ -regress_bandpass 0.01 1 \ -html_review_style pythonic We used similar command lines to generate ‘blurred and not censored’ and the ‘not blurred and not censored’ timeseries files (described more fully below). We will provide the code used to make all derivative files available on our github site (https://github.com/lab-lab/nndb).

We made one choice above that is different enough from our original pipeline that it is worth mentioning here. Specifically, we have quite long runs, with the average being ~40 minutes but this number can be variable (thus leading to the above issue with 3dDetrend’s -normalise). A discussion on the AFNI message board with one of our team (starting here, https://afni.nimh.nih.gov/afni/community/board/read.php?1,165243,165256#msg-165256), led to the suggestion that '-regress_polort 2' with '-regress_bandpass 0.01 1' be used for long runs. We had previously used only a variable polort with the suggested 1 + int(D/150) approach. Our new polort 2 + bandpass approach has the added benefit of working well with afni_proc.py.

Which timeseries file you use is up to you but I have been encouraged by Rick and Paul to include a sort of PSA about this. In Paul’s own words: * Blurred data should not be used for ROI-based analyses (and potentially not for ICA? I am not certain about standard practice). * Unblurred data for ISC might be pretty noisy for voxelwise analyses, since blurring should effectively boost the SNR of active regions (and even good alignment won't be perfect everywhere). * For uncensored data, one should be concerned about motion effects being left in the data (e.g., spikes in the data). * For censored data: * Performing ISC requires the users to unionize the censoring patterns during the correlation calculation. * If wanting to calculate power spectra or spectral parameters like ALFF/fALFF/RSFA etc. (which some people might do for naturalistic tasks still), then standard FT-based methods can't be used because sampling is no longer uniform. Instead, people could use something like 3dLombScargle+3dAmpToRSFC, which calculates power spectra (and RSFC params) based on a generalization of the FT that can handle non-uniform sampling, as long as the censoring pattern is mostly random and, say, only up to about 10-15% of the data. In sum, think very carefully about which files you use. If you find you need a file we have not provided, we can happily generate different versions of the timeseries upon request and can generally do so in a week or less.

Effect on results

From numerous tests on our own analyses, we have qualitatively found that results using our old vs the new afni_proc.py preprocessing pipeline do not change all that much in terms of general spatial patterns. There is, however, an
G
Movie theatres and drive-ins, by summary characteristics, inactive
open.canada.ca
www150.statcan.gc.ca
+1more
csv, html, xml
Updated Jan 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statistics Canada (2023). Movie theatres and drive-ins, by summary characteristics, inactive [Dataset]. https://open.canada.ca/data/en/dataset/7787dff5-c012-4766-8528-2be991cdbad3
Explore at:
csv, xml, htmlAvailable download formats
Dataset updated
Jan 17, 2023
Dataset provided by
Statistics Canada
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Description
This table contains 420 series, with data for years 1996/1997 - 2004/2005 (not all combinations necessarily have data for all years), and is no longer being released. This table contains data described by the following dimensions (Not all combinations are available): Geography (12 items: Canada; Newfoundland and Labrador; Prince Edward Island; Nova Scotia; ...), Type of venue (3 items: Total movie theatres and drive-ins; Movie theatres; Drive-ins), Summary characteristics (14 items: Number of theatres; Paid admissions; Average ticket prices; Number of screens; ...).
c
CIL (2012) CIL:42151, Mus musculus. CIL. Dataset
cellimagelibrary.org
zip
Updated Aug 26, 2012
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CIL (2012). CIL (2012) CIL:42151, Mus musculus. CIL. Dataset [Dataset]. http://doi.org/10.7295/W9CIL42151
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.7295/W9CIL42151
Dataset updated
Aug 26, 2012
Dataset provided by
CIL
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
Representative time-lapse movie of a normal mouse mammary fragment in collagen I. CIL 42168 is a related movie of a normal mammary fragment in Matrigel. Images taken every 20 min. This movie is part of a group of movies that include CIL 42151-42168.
P
InfiniBench Dataset
library.toponeai.link
paperswithcode.com
Updated Jun 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kirolos Ataallah; Chenhui Gou; Eslam Abdelrahman; Khushbu Pahwa; Jian Ding; Mohamed Elhoseiny (2024). InfiniBench Dataset [Dataset]. https://library.toponeai.link/dataset/infinibench
Explore at:
Dataset updated
Jun 27, 2024
Authors
Kirolos Ataallah; Chenhui Gou; Eslam Abdelrahman; Khushbu Pahwa; Jian Ding; Mohamed Elhoseiny
Description
We introduce InfiniBench a comprehensive benchmark for very long video understanding, which presents 1) The longest video duration, averaging 76.34 minutes; 2) The largest number of question-answer pairs, 108.2K; 3) Diversity in questions that examine nine different skills and include both multiple-choice questions and open-ended questions; 4) Humancentric, as the video sources come from movies and daily TV shows, with specific human-level question designs such as Movie Spoiler Questions that require critical thinking and comprehensive understanding. Using InfiniBench, we comprehensively evaluate existing Large MultiModality Models (LMMs) on each skill, including the commercial model Gemini 1.5 Flash and the open-source models. The evaluation shows significant challenges in our benchmark.Our results show that the best AI models such Gemini struggles to perform well with 42.72% average accuracy and 2.71 out of 5 average score.
h
sst2
huggingface.co
Updated Mar 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford NLP (2024). sst2 [Dataset]. https://huggingface.co/datasets/stanfordnlp/sst2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 26, 2024
Dataset authored and provided by
Stanford NLP
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Dataset Card for [Dataset Name]

Dataset Summary

The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. The corpus is based on the dataset introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/sst2.
Dog movie stars and dog breed popularity (data)
figshare.com
data.wu.ac.at
txt
Updated Jan 12, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefano Ghirlanda; Alberto Acerbi; Harold Herzog (2016). Dog movie stars and dog breed popularity (data) [Dataset]. http://doi.org/10.6084/m9.figshare.715262.v3
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.715262.v3
Dataset updated
Jan 12, 2016
Dataset provided by
Figsharehttp://figshare.com/
Authors
Stefano Ghirlanda; Alberto Acerbi; Harold Herzog
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The moviesAnalyzed.csv file is a comma-separatede-value file with thedata used in Ghirlanda S, Acerbi A, Herzog H, "Dog movie stars and dogbreed popularity," currently under review at Proceedings of the RoyalSociety of Lomdon, B. The columns in the file have the meaning given below. When a piece ofinformation was not found or cannot be computed, it is given as NA(see paper for possible reasons).

The quantities before[n], after[n], and effect[n] are calculated asgiven in the paper.

dog: name of the dog actor breed: the portrayed dog's breed year: the year of movie release title: the movie title earnings1: movie earnings during the opening weekend (in 2012 USD) earnings: total movie earnings (in 2012 USD) disney: whether the movies has been produced by the Walt Disney Company before[n]: the n-year popularity trend of the considered breed beforemovie release after[n]: the n-year popularity trend of the considered breed aftermovie release popularity[n]: average number of registrations for the consideredbreed in the 2n+1 years around movie release (between n years beforeand n years after) effect[n]: the n-year effect of the movie on the breed's popularity trend excess[n]: registrations of the considered breed attributable to movierelease (actual registrations over the n years after movie releaseminus registrations predicted based on the trend observed n yearsbefore movie release) viewers: estimated number of people who saw the movie viewers1: estimated number of people who saw the movie over itsopening weekend

Facebook

Twitter

Click to copy link

Link copied

Cite

Harshit@85 (2024). "9,565 Top-Rated Movies Dataset" [Dataset]. https://www.kaggle.com/datasets/harshit85/9565-top-rated-movies-dataset

"9,565 Top-Rated Movies Dataset"

"Analyzing the Most Loved Films Across Decades"

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 19, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Harshit@85

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

About the Dataset

Title: 9,565 Top-Rated Movies Dataset

Description:
This dataset offers a comprehensive collection of 9,565 of the highest-rated movies according to audience ratings on the Movie Database (TMDb). The dataset includes detailed information about each movie, such as its title, overview, release date, popularity score, average vote, and vote count. It is designed to be a valuable resource for anyone interested in exploring trends in popular cinema, analyzing factors that contribute to a movie’s success, or building recommendation engines.

Key Features: - Title: The official title of each movie. - Overview: A brief synopsis or description of the movie's plot. - Release Date: The release date of the movie, formatted as YYYY-MM-DD. - Popularity: A score indicating the current popularity of the movie on TMDb, which can be used to gauge current interest. - Vote Average: The average rating of the movie, based on user votes. - Vote Count: The total number of votes the movie has received.

Data Source: The data was sourced from the TMDb API, a well-regarded platform for movie information, using the /movie/top_rated endpoint. The dataset represents a snapshot of the highest-rated movies as of the time of data collection.

Data Collection Process: - API Access: Data was retrieved programmatically using TMDb’s API. - Pagination Handling: Multiple API requests were made to cover all pages of top-rated movies, ensuring the dataset’s comprehensiveness. - Data Aggregation: Collected data was aggregated into a single, unified dataset using the pandas library. - Cleaning: Basic data cleaning was performed to remove duplicates and handle missing or malformed data entries.

Potential Uses: - Trend Analysis: Analyze trends in movie ratings over time or compare ratings across different genres. - Recommendation Systems: Build and train models to recommend movies based on user preferences. - Sentiment Analysis: Perform text analysis on movie overviews to understand common themes and sentiments. - Statistical Analysis: Explore the relationship between popularity, vote count, and average ratings.

Data Format: The dataset is provided in a structured tabular format (e.g., CSV), making it easy to load into data analysis tools like Python, R, or Excel.

Usage License: The dataset is shared under [appropriate license], ensuring that it can be used for educational, research, or commercial purposes, with proper attribution to the data source (TMDb).

This description provides a clear and detailed overview, helping potential users understand the dataset's content, origin, and potential applications.

Clear search

Close search

Google apps

Main menu

"9,565 Top-Rated Movies Dataset"

About the Dataset

The Ultimate Film Statistics Dataset - for ML🏆🎬

IMDb Movie Review Dataset - Dataset - LDM

Popular TMDB Films Metadata Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Moviegalaxies – Social Networks in Movies

TMDb Top 10,000 Popular Movies Dataset

Introduction

About the Dataset

‘Movielens dataset’ analyzed by Analyst-2

Context

Content

Inspiration

Indonesian Film Database (IMDb)

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

‘Hollywood Theatrical Market Synopsis 1995 to 2021’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

NLP Corpus of Spanish Film Reviews

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Dataset for "Learning heterogeneous reaction kinetics from X-ray movies...

MAD Dataset

IMDb India Movies

Task Details

Every dataset has a story and this set is pulled from IMDb.com of all the Indian movies on the platform. Clean this data by removing missing values or adding average values this process will help to manipulate the data to help with your EDA.

Analyze data and provide some trends.

Data from: Twist - torsion coupling in beating axonemes

Naturalistic Neuroimaging Database

Overview

v2.0 Changes

Movie theatres and drive-ins, by summary characteristics, inactive

CIL (2012) CIL:42151, Mus musculus. CIL. Dataset

InfiniBench Dataset

sst2

Dog movie stars and dog breed popularity (data)

The quantities before[n], after[n], and effect[n] are calculated asgiven in the paper.

"9,565 Top-Rated Movies Dataset"

"Analyzing the Most Loved Films Across Decades"

About the Dataset