Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('imdb_reviews', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Unlock one of the most comprehensive movie datasets available—4.5 million structured IMDb movie records, extracted and enriched for data science, machine learning, and entertainment research.
This dataset includes a vast collection of global movie metadata, including details on title, release year, genre, country, language, runtime, cast, directors, IMDb ratings, reviews, and synopsis. Whether you're building a recommendation engine, benchmarking trends, or training AI models, this dataset is designed to give you deep and wide access to cinematic data across decades and continents.
Perfect for use in film analytics, OTT platforms, review sentiment analysis, knowledge graphs, and LLM fine-tuning, the dataset is cleaned, normalized, and exportable in multiple formats.
Genres: Drama, Comedy, Horror, Action, Sci-Fi, Documentary, and more
Train LLMs or chatbots on cinematic language and metadata
Build or enrich movie recommendation engines
Run cross-lingual or multi-region film analytics
Benchmark genre popularity across time periods
Power academic studies or entertainment dashboards
Feed into knowledge graphs, search engines, or NLP pipelines
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The original dataset, contribution from Arevalo et al. in their work Gated Multimodal Units for Information Fusion can be downloaded from their git repository where you can also make use of the web-scrapping scripts they used to create it. From there you can download the hdf5 file and metadata.
The main problem is that this dataset contains data that in many cases is not necessary, for example the image latent features, the words n-grams, imdb ids... Furthermore, the poster captions are already tokenized, so if you want to see the real text then you must apply the ix_to_word dictionary from the metadata, which adds an extra step if you are trying different word tokenizers. The hdf5 file ends up being 15.6GB, plus the metadata npy file which is 65MB, makes a rather big dataset to meddle with if you really want to just use the minimal information.
Simplified MM-IMDb only has two files: - data.npy (18.1MB). Stores image index, one-hot encoding of the genre, and the caption/description of the poster. - images.npz (3.2GB). Stores all dataset images as numpy arrays.
With this dataset you can start training your multimodal models for multi-class classification, modality alignment, Masked-Language-Modelling, caption-based image retrieval, visual question answering, and many more.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Scooby-Doo is one of the most iconic cartoon characters of all time. The lovable Great Dane and his human friends have been solving mysteries and catching bad guys for over 50 years.
This dataset contains information on every Scooby-Doo episode and movie, including the title, air date, run time, and various other variables. It took me over a year to watch every Scooby-Doo iteration and track every variable. Many values are subjective by nature of watching but I tried my hardest to keep the data collection consistent.
If you plan to use this data for anything school/entertainment related you are free to (credit is always welcome)
To use this dataset, simply download it and then import it into your preferred software program. Once you have imported the dataset, you can then begin to analyze the data.
There are a number of different ways that you can analyze this data. For example, you could look at the distribution of Scooby Doo episodes by season, or by year. You could also look at the popularity of different Scooby Doo characters by looking at how often they are mentioned in the dataset.
This dataset is a great resource for anyone interested in Scooby Doo, or in analyzing television data more generally. Enjoy!
-Using the IMDB rating, run time, and engagement score, predict how much I will enjoy an episode/movie. -Determine which network airs the best Scooby-Doo content based on average IMDB rating and engagement score. -Analyze the impact of gender on catch rate for monsters/culprits
License
License: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) - You are free to: - Share - copy and redistribute the material in any medium or format for non-commercial purposes only. - Adapt - remix, transform, and build upon the material for non-commercial purposes only. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - You may not: - Use the material for commercial purposes.
File: scoobydoo.csv | Column name | Description | |:-----------------------------|:----------------------------------------------------------------------------------------| | level_0 | The level of the episode or movie. (Numeric) | | series_name | The name of the series the episode or movie is from. (String) | | network | The network the episode or movie aired on. (String) | | season | The season of the series the episode or movie is from. (Numeric) | | title | The title of the episode or movie. (String) | | imdb | The IMDB rating of the episode or movie. (Numeric) | | engagement | The engagement rating of the episode or movie. (Numeric) | | date_aired | The date the episode or movie aired. (Date) | | run_time | The run time of the episode or movie. (Time) | | format | The format of the episode or movie. (String) | | monster_name | The name of the monster in the episode or movie. (String) | | monster_gender | The gender of the monster in the episode or movie. (String) | | monster_type | The type of monster in the episode or movie. (String) | | monster_subtype | The subtype of monster in the episode or movie. (String) | | monster_species | The species of monster in the episode or movie. (String) | | monster_real | Whether the monster is real or not. (Boolean) | | monster_amount | The number of monsters in the episode or movie. (Numeric) ...
https://networkrepository.com/policy.phphttps://networkrepository.com/policy.php
Co-authoship networks, collaboration networks, collaboration graphs, communication networks, email networks, IMDB, aminer data, DBLP data, network science co-authorship network, citeseer, HepPh, CondMat, download information networks, collaboration graph data
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('imdb_reviews', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.