20 datasets found

o
IMDb Movie Genre Classification Dataset
opendatabay.com
kaggle.com
.undefined
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). IMDb Movie Genre Classification Dataset [Dataset]. https://www.opendatabay.com/data/web-social/2e13f07c-9c7c-4856-80c0-1a027f82b3c9
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 23, 2025
Dataset authored and provided by
Datasimple
Area covered
Entertainment & Media Consumption
Description
Overview of the Data movies_overview.csv:

Columns: title: The movie title overview: A brief description or synopsis of the movie genre_ids: One or more genre identifiers (which could be multi-label) movies_genres.csv:

Columns: id: Genre identifier name: The corresponding genre name This setup allows you to map each movie’s genre_ids from the movies_overview file to its actual genre names using the movies_genres mapping.

Proposed NLP Task: Multi-Label Genre Classification Objective: Create an NLP model that, given a movie’s overview, predicts the correct genre(s) for that movie. Since a movie may belong to multiple genres, this is a multi-label classification task.

Challenge Breakdown a. Problem Statement Participants are tasked with designing and training an NLP model that takes as input the movie overview text and outputs one or more genres. The challenge could encourage approaches that span from classical text classification methods (e.g., TF-IDF with logistic regression) to modern transformer-based models (e.g., BERT, RoBERTa).

b. Data Preprocessing Text Cleaning & Tokenization: Clean the movie overviews (e.g., lowercasing, removing special characters) and tokenize the text. Label Preparation: Transform the genre_ids into a multi-label format. Use movies_genres.csv to convert these IDs into genre names. Data Splitting: Create training, validation, and test sets, ensuring the distribution of genres is well represented. c. Baseline Models Encourage participants to start with simple models (e.g., bag-of-words, TF-IDF combined with logistic regression or random forest) and progress towards deep learning approaches like LSTM-based networks or transformer models.

d. Evaluation Metrics Since this is a multi-label task, consider evaluation metrics such as:

F1 Score (Macro / Micro): Balances precision and recall. Hamming Loss: Measures how many labels are incorrectly predicted. Subset Accuracy: For stricter evaluation (all labels must match exactly). 4. Additional Considerations Baseline Code & Notebooks: Provide a starter notebook with initial data loading, preprocessing, and a simple baseline model. This helps lower the entry barrier for participants who may be new to multi-label NLP tasks.

Evaluation Server & Leaderboard: Ensure that your Kaggle competition setup allows for automatic evaluation using the selected metrics and that a public leaderboard is available for continuous feedback.

Documentation & Discussion: Include detailed documentation describing the datasets, the task requirements, and the evaluation procedure. Additionally, host a discussion forum to foster collaboration among participants.

Final Remarks This challenge not only tests participants’ ability to handle multi-label classification and text processing but also encourages them to explore advanced NLP techniques and model evaluation strategies. The combination of movie overviews and genre mapping offers a rich and interesting dataset for an engaging Kaggle competition.

Original Data Source: IMDb Movie Genre Classification Dataset
Movie & Rating Data - 2025
kaggle.com
Updated Feb 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayberk URAL (2025). Movie & Rating Data - 2025 [Dataset]. https://www.kaggle.com/datasets/ayberkural/movielens-movie-csv-and-rating-csv/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 23, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ayberk URAL
Description
🎬 Movie Ratings & Metadata Dataset This dataset provides a structured collection of movie ratings along with essential metadata, making it a valuable resource for data analysis and machine learning projects. It consists of two main tables:

📊 Ratings Data This table contains user-generated ratings for various movies.

userId → Unique identifier for each user. movieId → Unique identifier for each movie (linked to the movie metadata table). rating → Rating given by the user (typically on a scale of 0-5). timestamp → Time when the rating was given (Unix format). 🎞️ Movie Metadata This table provides details about the movies being rated.

movieId → Unique identifier for each movie (linked to the ratings table). title → Movie title, including the release year. genres → List of genres associated with the movie (e.g., Action, Comedy, Drama). 🎯 Use Cases: ✔️ Recommender Systems – Build personalized movie recommendation models. ✔️ Trend Analysis – Explore how audience preferences change over time. ✔️ Sentiment & Popularity Analysis – Compare ratings across different genres.

This dataset is clean, structured, and ready for exploration. 🚀

👉 Start analyzing and uncover interesting movie insights! 🎥
h
Data from: imdb
huggingface.co
Updated Aug 3, 2003
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford NLP (2003). imdb [Dataset]. https://huggingface.co/datasets/stanfordnlp/imdb
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 3, 2003
Dataset authored and provided by
Stanford NLP
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for "imdb"

Dataset Summary

Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
IMDb India Movies
kaggle.com
Updated Jun 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adrian McMahon (2021). IMDb India Movies [Dataset]. https://www.kaggle.com/adrianmcmahon/imdb-india-movies/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 18, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Adrian McMahon
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
India
Description
Task Details

Every dataset has a story and this set is pulled from IMDb.com of all the Indian movies on the platform. Clean this data by removing missing values or adding average values this process will help to manipulate the data to help with your EDA.

Analyze data and provide some trends.

Year with best rating

Does length of movie have any impact with the rating?

Top 10 movies according to rating per year and overall.

Number of popular movies released each year.

Counting the number of votes which movies preformed better in rating per year and overall.

Any other trends or future prediction you may have

Which director directed the most movies

Which actor starred in the movie

Any other trends you can find

Thank you for viewing my dataset, looking forward to seeing some codes.
D
Using social network information to discover truth of movie ranking
researchdata.ntu.edu.sg
tsv, txt
Updated Jun 10, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DR-NTU (Data) (2018). Using social network information to discover truth of movie ranking [Dataset]. http://doi.org/10.21979/N9/L5TTRW
Explore at:
tsv(4143), tsv(26553), txt(1857)Available download formats
Unique identifier
https://doi.org/10.21979/N9/L5TTRW
Dataset updated
Jun 10, 2018
Dataset provided by
DR-NTU (Data)
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The real dataset consists of movie evaluations from IMDB, which provides a platform where individuals can evaluate movies on a scale of 1 to 10. If a user rates a movie and clicks the share button, a Twitter message is generated. We then extract the rating from the Twitter message. We treat the ratings on the IMDB website as the event truths, which are based on the aggregated evaluations from all users, whereas our observations come from only a subset of users who share their ratings on Twitter. Using the Twitter API, we collect information about the follower and following relationships between individuals that generate movie evaluation Twitter messages. To better show the influence of social network information on event truth discovery, we delete small subnetworks that consist of less than 5 agents. The final dataset we use consists of 2266 evaluations from 209 individuals on 245 movies (events) and also the social network between these 209 individuals. We regard the social network to be undirected as both follower or following relationships indicate that the two users have similar taste.
u
LifeQA dataset features
deepblue.lib.umich.edu
Updated Mar 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Castro, Santiago; Azab, Mahmoud; Stroud, Jonathan C.; Noujaim, Cristina; Wang, Ruoyao; Deng, Jia; Mihalcea, Rada (2025). LifeQA dataset features [Dataset]. http://doi.org/10.7302/nbj0-np80
Explore at:
Unique identifier
https://doi.org/10.7302/nbj0-np80
Dataset updated
Mar 7, 2025
Dataset provided by
Deep Blue Data
Authors
Castro, Santiago; Azab, Mahmoud; Stroud, Jonathan C.; Noujaim, Cristina; Wang, Ruoyao; Deng, Jia; Mihalcea, Rada
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We introduce LifeQA, a benchmark dataset for video question answering focusing on daily real-life situations. Current video question-answering datasets consist of movies and TV shows. However, it is well-known that these visual domains do not represent our day-to-day lives. Movies and TV shows, for example, benefit from professional camera movements, clean editing, crisp audio recordings, and scripted dialog between professional actors. While these domains provide a large amount of data for training models, their properties make them unsuitable for testing real-life question-answering systems. Our dataset, by contrast, consists of video clips that represent only real-life scenarios. We collect 275 such video clips and over 2.3k multiple-choice questions. In this paper, we analyze the challenging but realistic aspects of LifeQA and apply several state-of-the-art video question-answering models to provide benchmarks for future research.

For more information, refer to https://lit.eecs.umich.edu/lifeqa/.
Top 6000 TMDB Italian Movies
kaggle.com
Updated May 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mateusz Miroslaw Lis (2023). Top 6000 TMDB Italian Movies [Dataset]. https://www.kaggle.com/datasets/mateuszmiroslawlis/top6000-tmdb-italian-films
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 9, 2023
Dataset provided by
Kaggle
Authors
Mateusz Miroslaw Lis
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Introduction

This dataset provides information on the top 6000 (actually 6118) most popular italian movies available on TMDB as of May 2023. The data was collected by a Python script through the TMDB API, entries were then filtered to remove any film that had "overview", "release_date" and "vote_average" data missing. Some details such as "genre", "duration", etc. where omitted but they can be easily retrieved with TMDB API by use of movie_id parameter from the "id" column.

Contents

id: Contains the TMDB's id for the movie;

title: Contains the title of the movie;

overview: Contains a description of the movie, usually a logline;

release_date: Contains the release date of the movie;

vote_average: Contains the arithmetic average of the user's votes;
u
Data from: EmoKey Moments Muse EEG Dataset (EKM-ED): A Comprehensive...
produccioncientifica.ugr.es
data.niaid.nih.gov
+1more
Updated 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francisco M. Garcia-Moreno; Marta Badenes-Sastre; Francisco M. Garcia-Moreno; Marta Badenes-Sastre (2020). EmoKey Moments Muse EEG Dataset (EKM-ED): A Comprehensive Collection of Muse S EEG Data and Key Emotional Moments [Dataset]. https://produccioncientifica.ugr.es/documentos/668fc432b9e7c03b01bd6125
Explore at:
Dataset updated
2020
Authors
Francisco M. Garcia-Moreno; Marta Badenes-Sastre; Francisco M. Garcia-Moreno; Marta Badenes-Sastre
Description
EmoKey Moments Muse EEG Dataset (EKM-ED): A Comprehensive Collection of Muse S EEG Data and Key Emotional Moments

Dataset Description:

The EmoKey Moments EEG Dataset (EKM-ED) is an intricately curated dataset amassed from 47 participants, detailing EEG responses as they engage with emotion-eliciting video clips. Covering a spectrum of emotions, this dataset holds immense value for those diving deep into human cognitive responses, psychological research, and emotion-based analyses.

Dataset Highlights:

Precise Timestamps: Capturing the exact millisecond of EEG data acquisition, ensuring unparalleled granularity.

Brainwave Metrics: Illuminating the variety of cognitive states through the prism of Delta, Theta, Alpha, Beta, and Gamma waves.

Motion Data: Encompassing the device's movement in three dimensions for enhanced contextuality.

Auxiliary Indicators: Key elements like the device's positioning, battery metrics, and user-specific actions are meticulously logged.

Consent and Ethics: The dataset respects and upholds privacy and ethical standards. Every participant provided informed consent. This endeavor has received the green light from the Ethics Committee at the University of Granada, documented under the reference: 2100/CEIH/2021.

A pivotal component of this dataset is its focus on "key moments" within the selected video clips, honing in on periods anticipated to evoke heightened emotional responses.

Curated Video Clips within Dataset:

Film Emotion Duration (seconds) The Lover Baseline 43 American History X Anger 106 Cry Freedom Sadness 166 Alive Happiness 310 Scream Fear 395

The cornerstone of EKM-ED is its innovative emphasis on these key moments, bringing to light the correlation between distinct cinematic events and specific EEG responses.

Key Emotional Moments in Dataset:

Film Emotion Key moment timestamps (seconds) American History X Anger 36, 57, 68 Cry Freedom Sadness 112, 132, 154 Alive Happiness 227, 270, 289 Scream Fear 23, 42, 79, 226, 279, 299, 334

Citation: Gilman, T. L., et al. (2017). A film set for the elicitation of emotion in research. Behavior Research Methods, 49(6). Link to the study

With its unparalleled depth and focus, the EmoKey Moments EEG Dataset aims to advance research in fields such as neuroscience, psychology, and affective computing, providing a comprehensive platform for understanding and analyzing human emotions through EEG data.

——————————————————————————————————— FOLDER STRUCTURE DESCRIPTION ———————————————————————————————————

questionnaires: all there response questionnaires (Spanish); raw and preprocessed Including SAM | ——preprocessed: Ficha_Evaluacion_Participante_SAM_Refactored.csv: the SAM responses for every film clip

key_moments: the key moment timestamps for every emotion’s clip

muse_wearable_data: XXXX | |—raw |——1: ID = 1 of subject |————muse: EEG data of Muse device |—————————ANGER_XXX.csv : leg data of the anger elicitation |—————————FEAR_XXX.csv : leg data of the fear elicitation |—————————HAPPINESS_XXX.csv : leg data of the happiness elicitation |—————————SADNESS_XXX.csv : leg data of the sadness elicitation |————order: film elicitation order of play: For example: HAPPINESS,SADNESS,ANGER,FEAR … | |—preprocessed |——unclean-signals: without removing EEG artifacts, noise, etc. |————muse: EEG data of Muse device |—————————0.0078125: data downsampled to 128 Hz from 256Hz recorded |——clean-signals: removed EEG artifacts, noise, etc. |————muse: EEG data of Muse device |—————————0.0078125: data downsampled to 128 Hz from 256Hz recorded

The ethical consent for this dataset was provided by La Comisión de Ética en Investigación de la Universidad de Granada, as documented in the approval titled: 'DETECCIÓN AUTOMÁTICA DE LAS EMOCIONES BÁSICAS Y SU INFLUENCIA EN LA TOMA DE DECISIONES MEDIANTE WEARABLES Y MACHINE LEARNING' registered under 2100/CEIH/2021.
WIRED ICM Sample Dataset - Workshop on Intracranial Recordings in Humans,...
openneuro.org
Updated Mar 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liberty S. Hamilton; Maansi Desai; Alyssa Field (2024). WIRED ICM Sample Dataset - Workshop on Intracranial Recordings in Humans, Epilepsy, DBS [Dataset]. http://doi.org/10.18112/openneuro.ds004993.v1.1.2
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds004993.v1.1.2
Dataset updated
Mar 1, 2024
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
Liberty S. Hamilton; Maansi Desai; Alyssa Field
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
WIRED ICM TUTORIAL DATA

Contributors: Liberty S. Hamilton, PhD, Maansi Desai, PhD, Alyssa Field, MEd

Email: liberty.hamilton@austin.utexas.edu

This is a sample BIDS dataset for the WIRED ICM course in Paris, France in March 2024.

This contains intracranial recordings collected by the Hamilton Lab at the University of Texas at Austin. These recordings include examples of evoked data during natural listening tasks along with some examples of seizure-related activity and vagus nerve stimulator (VNS) artifact for illustrative purposes. All procedures were approved by the University of Texas at Austin Institutional Review Board.

Funding: Support was provided by the National Institutes of Health National Institute on Deafness and Other Communication Disorders (R01 DC018579, to LSH).

Tasks:

movietrailers - this task involves patients listening to movie clips from various Pixar, Disney, Dreamworks, and other movies. We have published previously using these stimuli in EEG (Desai et al. 2021).

timit4 and timit5 - these tasks involve patients listening to subsets of the TIMIT acoustic phonetic corpus (Garofolo et al 1993). The events provided in the dataset mark the onset and offset of each sentence. In timit4, each sentence is unique, while in timit5, 10 sentences are repeated 10 times. This is the same stimulus set used in Mesgarani et al. 2014, Hamilton et al. 2018, Hamilton et al. 2021, and Desai et. al 2021.

Notes:

The movie trailer data for subject W1 was acquired at the start of a generalized tonic clonic seizure, and the research session was terminated. Large, synchronized spikes can be observed on multiple channels on the right parietal grid throughout the iEEG data.

The TIMIT data for subject W2 is an example of fairly clean sentence evoked data.

The TIMIT data for subject W3 is a good example of on-and-off VNS artifact. The VNS has a strong artifact at ~20 Hz. Some patients with epilepsy may have these implanted devices to help control their seizures, so you should know how to spot artifact-related activity. Despite these artifacts, the evoked responses to sentences are quite strong.

The acquisition number (B3, B8, etc) has to do with the order in which this task was run relative to other tasks in an iEEG session, and can be ignored here.

References

Appelhoff, S., Sanderson, M., Brooks, T., Vliet, M., Quentin, R., Holdgraf, C., Chaumon, M., Mikulan, E., Tavabi, K., Höchenberger, R., Welke, D., Brunner, C., Rockhill, A., Larson, E., Gramfort, A. and Jas, M. (2019). MNE-BIDS: Organizing electrophysiological data into the BIDS format and facilitating their analysis. Journal of Open Source Software 4: (1896). https://doi.org/10.21105/joss.01896

Desai, M., Holder, J., Villarreal, C., Clark, N., Hoang, B., & Hamilton, L. S. (2021). Generalizable EEG encoding models with naturalistic audiovisual stimuli. Journal of Neuroscience, 41(43), 8946-8962.

Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., & Pallett, D. S. (1993). DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon technical report n, 93, 27403.

Hamilton, L. S., Edwards, E., & Chang, E. F. (2018). A spatial map of onset and sustained responses to speech in the human superior temporal gyrus. Current Biology, 28(12), 1860-1871.

Hamilton, L. S., Oganian, Y., Hall, J., & Chang, E. F. (2021). Parallel and distributed encoding of speech across human auditory cortex. Cell, 184(18), 4626-4639.

Holdgraf, C., Appelhoff, S., Bickel, S., Bouchard, K., D'Ambrosio, S., David, O., … Hermes, D. (2019). iEEG-BIDS, extending the Brain Imaging Data Structure specification to human intracranial electrophysiology. Scientific Data, 6, 102. https://doi.org/10.1038/s41597-019-0105-7

Mesgarani, N., Cheung, C., Johnson, K., & Chang, E. F. (2014). Phonetic feature encoding in human superior temporal gyrus. Science, 343(6174), 1006-1010.

ForrestGump-MEG

openneuro.org

Updated May 17, 2021

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Xingyu Liu; Yuxuan Dai; Hailun Xie; Zonglei Zhen (2021). ForrestGump-MEG [Dataset]. http://doi.org/10.18112/openneuro.ds003633.v1.0.1

Explore at:

Unique identifier

https://doi.org/10.18112/openneuro.ds003633.v1.0.1

Dataset updated

May 17, 2021

Dataset provided by

OpenNeurohttps://openneuro.org/

Authors

Xingyu Liu; Yuxuan Dai; Hailun Xie; Zonglei Zhen

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

*****ForrestGump-MEG: A audio-visual movie watching MEG dataset*****

For details please refer to our paper on [].

This dataset contains MEG data recorded from 11 subjects while watching the 2h long Chinese-dubbed audio-visual movie 'Forrest Gump'. The data were acquired with a 275-channel CTF MEG. Auxiliary data (T1w) as well as derivation data such as preprocessed data and MEG-MRI co-registration are also included.

Pre-process procedure description

The T1w images stored as NIFTI files were minimally-preprocessed using the anatomical preprocessing pipeline from fMRIPrep with default settings.

MEG data were pre-processed using MNE following a three-step procedure: 1. bad channels were detected and removed. 2. a high-pass filter of 1 Hz was applied to remove possible slow drifts from the continuous MEG data. 3. artifacts removal was performed with ICA.

Stimulus material

The audio-visual stimulus material was from the Chinese-dubbed 'Forrest Gump' DVD released in 2013 (ISBN: 978-7-7991-3934-0), which cannot be publicly released here due to copyright restrictions.

Dataset content overview

the data were organized following the MEG-BIDS using MNE-BIDS toolbox.

the pre-processed MEG data

The preprocessed MEG recordings including the preprocessed MEG data, the event files, the ICA decomposition and label files and the MEG-MRI coordinate transformation file are hosted here.

芒聰聰芒聰聙芒聰聙 ./derivatives/preproc_meg-mne_mri-fmriprep/sub-xx/ses-movie/meg/
  芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_coordsystem.json
  芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_channels.tsv
  芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_decomposition.tsv
  芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_events.tsv
  芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_ica.fif.gz
  芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_meg.fif
  芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_meg.json
  芒聰聹芒聰聙芒聰聙 ...
  芒聰聰芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_trans.fif

the pre-processed MRI data

The preprocessed MRI volume, reconstructed surface, and other associations including transformation files are hosted here

芒聰聰芒聰聙芒聰聙 ./derivatives/preproc_meg-mne_mri-fmriprep/sub-xx/ses-movie/anat/
  芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_desc-preproc_T1w.nii.gz
  芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_hemi-L_inflated.surf.gii
  芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_hemi-L_midthickness.surf.gii
  芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_hemi-L_pial.surf.gii
  芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_hemi-L_smoothwm.surf.gii
  芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_hemi-R_inflated.surf.gii
  芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_hemi-R_midthickness.surf.gii
  芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_hemi-R_pial.surf.gii
  芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_hemi-R_smoothwm.surf.gii
  芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_space-MNI152NLin2009cAsym_desc-preproc_T1w.nii.gz
  芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_space-MNI152NLin6Asym_desc-preproc_T1w.nii.gz
  芒聰聰芒聰聙芒聰聙 ...

the FreeSurfer surface data, the high-resolution head surface and the MRI-fiducials are provided here

芒聰聰芒聰聙芒聰聙 ./derivatives/preproc_meg-mne_mri-fmriprep/sourcedata/
  芒聰聰芒聰聙芒聰聙 freesurfer  
    芒聰聰芒聰聙芒聰聙 sub-xx
    芒聰聰芒聰聙芒聰聙 ...

the raw data

芒聰聰芒聰聙芒聰聙 ./sub-xx/ses-movie/  
  芒聰聹芒聰聙芒聰聙 meg/
  芒聰聜 芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_coordsystem.json
  芒聰聜 芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_channels.tsv
  芒聰聜 芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_events.tsv
  芒聰聜 芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_meg.ds
  芒聰聜 芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_meg.json
  芒聰聜 芒聰聰芒聰聙芒聰聙 ...    
  芒聰聰芒聰聙芒聰聙 anat/
    芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_T1w.json
    芒聰聰芒聰聙芒聰聙 sub-xx_ses-movie_T1w.nii.gz

h
sst2
huggingface.co
Updated May 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford NLP (2023). sst2 [Dataset]. https://huggingface.co/datasets/stanfordnlp/sst2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 8, 2023
Dataset authored and provided by
Stanford NLP
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Dataset Card for [Dataset Name]

Dataset Summary

The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. The corpus is based on the dataset introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/sst2.
Data from: Individual differences in neural event segmentation of continuous...
openneuro.org
Updated Jan 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Clara Sava-Segal; Chandler Richards; Megan Leung; Emily S. Finn (2024). Individual differences in neural event segmentation of continuous experiences [Dataset]. http://doi.org/10.18112/openneuro.ds004516.v1.0.2
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds004516.v1.0.2
Dataset updated
Jan 25, 2024
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
Clara Sava-Segal; Chandler Richards; Megan Leung; Emily S. Finn
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Four stimuli were used:

(Note that the versions presented to subjects were edited to remove credits and title pages; these edited versions are available upon request.)

Stimuli Description - Iteration (https://youtu.be/c53fGdK84rc; 12:27 min:s) is a sci-fi movie that follows a female character as she goes through multiple iterations of waking up and trying to escape a facility. A male character appears toward the end to help her. - Defeat (https://youtu.be/6yN9VH_4GSQ; 7:57 min:s) follows a family of three (mother, two children) as the brother bullies his sister and she builds a time machine to go back and get revenge. - Growth (https://youtu.be/JyvFXBA3O8o; 8:27 min:s) follows a family of four (mother, father, two brothers) as the children grow up and eventually move out amid some family conflict. - Lemonade (https://youtu.be/Av07QiqmsoA; 7:27 min:s) is a Rube-Goldberg machine consisting of a series of objects that move throughout a house and ends in the pouring of a cup of lemonade. This movie was lightly edited to remove fleeting shots of human characters. Iteration and Defeat both contained screen cuts (continuity editing), whereas Growth and Lemonade were shot in a continuous fashion with the camera panning smoothly from one scene to the next.

Runs are a bit longer than the movie stimuli themselves. We dropped the first 2 TRs and the last 12 TRs for each functional run. Please reach out if you have any questions.

For Growth, this corresponds to TRs: 2:505
For Lemonade, this corresponds to TRs: 2:449
For Defeat, this corresponds to TRs: 2:480
For Iteration, this corresponds to TRs: 2:748
P
MPI Sintel Dataset
paperswithcode.com
opendatalab.com
Updated May 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel J. Butler; Jonas Wulff; Garrett B. Stanley; Michael J. Black (2021). MPI Sintel Dataset [Dataset]. https://paperswithcode.com/dataset/mpi-sintel
Explore at:
Dataset updated
May 13, 2021
Authors
Daniel J. Butler; Jonas Wulff; Garrett B. Stanley; Michael J. Black
Description
MPI (Max Planck Institute) Sintel is a dataset for optical flow evaluation that has 1064 synthesized stereo images and ground truth data for disparity. Sintel is derived from open-source 3D animated short film Sintel. The dataset has 23 different scenes. The stereo images are RGB while the disparity is grayscale. Both have resolution of 1024×436 pixels and 8-bit per channel.
f
Data from: Template-stripped substrates with solvent-impermeable metal thin...
figshare.com
acs.figshare.com
zip
Updated May 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cynthia Avedian; Christina D. M. Trang; Michael S. Inkpen (2025). Template-stripped substrates with solvent-impermeable metal thin films [Dataset]. http://doi.org/10.1021/acsnanoscienceau.5c00018.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acsnanoscienceau.5c00018.s001
Dataset updated
May 21, 2025
Dataset provided by
ACS Publications
Authors
Cynthia Avedian; Christina D. M. Trang; Michael S. Inkpen
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Template-stripped substrates provide on-demand access to clean, ultraflat gold surfaces, avoiding the need for laborious cleaning procedures or the use of expensive single-crystal electrodes. While these gold/adhesion layer/support sandwich structures are most conveniently prepared through the application of epoxy or optical adhesives, such composites exhibit instabilities in organic solvents that limit their wider application. Here we demonstrate that substrates with solvent-impermeable metal films can be used in previously problematic chemical environments after integration into a protective, custom-built (electrochemical) flow cell. We apply our methodology to probe different self-assembled monolayers, observing reproducible alkanethiol reductive desorption features, an exemplary redox response using 6-(ferrocenyl)hexanethiol, and corroborate findings that cobalt(II) bis(terpyridine) assemblies exhibit a low coverage. This work significantly extends the utility of these substrates, relative to mechanically polished or freshly deposited alternatives, particularly for studies of systems involving adsorbed molecules whose properties are strongly influenced by the nanoscopic features of the metal-solution interface.
P
OpenSubtitles Dataset
paperswithcode.com
Updated Jul 10, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pierre Lison; J{\"o}rg Tiedemann (2022). OpenSubtitles Dataset [Dataset]. https://paperswithcode.com/dataset/opensubtitles
Explore at:
Dataset updated
Jul 10, 2022
Authors
Pierre Lison; J{\"o}rg Tiedemann
Description
OpenSubtitles is collection of multilingual parallel corpora. The dataset is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages.
350 000+ movies from themoviedb.org
kaggle.com
zip
Updated Oct 12, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stephanerappeneau (2017). 350 000+ movies from themoviedb.org [Dataset]. https://www.kaggle.com/stephanerappeneau/350-000-movies-from-themoviedborg
Explore at:
zip(70483259 bytes)Available download formats
Dataset updated
Oct 12, 2017
Authors
Stephanerappeneau
Description
Context

I love movies.

I tend to avoid marvel-transformers-standardized products, and prefer a mix of classic hollywood-golden-age and obscure polish artsy movies. Throw in an occasional japanese-zombie-slasher-giallo as an alibi. Good movies don't exist without bad movies.

On average I watch 200+ movies each year, with peaks at more than 500 movies. Nine years ago I started to log my movies to avoid watching the same movie twice, and also assign scores. Over the years, it gave me a couple insights on my viewing habits but nothing more than what a tenth-grader would learn at school.

I've recently suscribed to Netflix and it pains me to see the global inefficiency of recommendation systems for people like me, who mostly swear by "La politique des auteurs". It's a term coined by famous new-wave french movie critic André Bazin, meaning that the quality of a movie is essentially linked to the director and it's capacity to execute his vision with his crew. We could debate it depends on movie production pipeline, but let's not for now. Practically, what it means, is that I essentially watch movies from directors who made films I've liked.

I suspect Neflix calibrate their recommandation models taking into account the way the "average-joe" chooses a movie. A few months ago I had read a study based on a survey, showing that people chose a movie mostly based on genre (55%), then by leading actors (45%). Director or Release Date were far behind around 10% each. It is not surprising, since most people I know don't care who the director is. Lots of US blockbusters don't even mention it on the movie poster. I am aware that collaborative filtering is based on user proximity , which I believe decreases (or even eliminates) the need to characterize a movie. So here I'm more interested in content based filtering which is based on product proximity for several reasons :

Users tastes are not easily accessible. It is, after all, Netflix treasure chest

Movie offer on Netflix is so bad for someone who likes author's movies that it wouldn't help

Modeling a movie intrinsic qualities is a nice challenge

Enough.

"*The secret of getting ahead is getting started*" (Mark Twain)

https://img11.hostingpics.net/pics/117765networkgraph.png" alt="network graph">

Content

The primary source is www.themoviedb.org. If you watch obscure artsy romanian homemade movies you may find only 95% of your movies referenced...but for anyone else it should be in the 98%+ range.

movies details are from www.themoviedb.org API : movies/details

movies crew & casting are from www.themoviedb.org API : movies/credits

both can be joined by id

they contain all 350k movies up, from end of 19th century to august 2017. If you remove short movies from imdb you get similar amounts of movies.

I uploaded the program to retrieve incremental movie details on github : https://github.com/stephanerappeneau/scienceofmovies/tree/master/PycharmProjects/GetAllMovies (need a dev API key from themoviedb.org though)

I have tried various supervised (decision tree) / unsupervised (clustering, NLP) approaches described in the discussions, source code is on github : https://github.com/stephanerappeneau/scienceofmovies

As a bonus I've uploaded the bio summary from top 500 critically-acclaimed directors from wikipedia, for some interesting NLTK analysis

Here is overview of the available sources that I've tried :

• Imdb.com free csv dumps (ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/temporaryaccess/) are badly documented, incomplete, loosely structured and impossible to join/merge. There's an API hosted by Amazon Web Service : 1€ every 100 000 requests. With around 1 million movies, it could become expensive also features are bare. So I've searched for other sources.

• www.themoviedb.org is based on crowdsourcing and has an excellent API, limited to 40 requests every 10 seconds. It is quite generous, well documented, and enough to sweep the 450 000 movies in a few days. For my purpose, data quality is not significantly worse than imdb, and as imdb key is also included there's always the possibility to complete my dataset later (I actually did it)

• www.Boxofficemojo.com has some interesting budget/revenue figures (which are sorely lacking in both imdb & tmdb), but it actually tracks only a few thousand movies, mainly blockbusters. There are other professional sources that are used by film industry to get better predictive / marketing insights but that's beyond my reach for this experiment.

• www.wikipedia.com is an interesting source with no real cap on API calls, however it requires a bit of webscraping and for movies or directors the layout and quality varies a lot, so I suspected it'd get a lot of work to get insights so I put this source in lower priority.

• www.google.com will ban you after a few minutes of web scraping because their job is to scrap data from others, than sell it, duh.

• It's worth mentionning that there are a few dumps of Netflix anonymized user tastes on kaggle, because they've organised a few competitions to improve their recommendation models. https://www.kaggle.com/netflix-inc/netflix-prize-data

• Online databases are largely white anglo-saxon centric, meaning bollywood (India is the 2nd bigger producer of movies) offer is mostly absent from datasets. I'm fine with that, as it's not my cup of tea plus I lack domain knowledge. The sheer amount of indian movies would probably skew my results anyway (I don't want to have too many martial-arts-musicals in my recommendations ;-)). I have, however, tremendous respect for indian movie industry so I'd love to collaborate with an indian cinephile ! https://img11.hostingpics.net/pics/340226westerns.png" alt="Westerns">

Inspiration

Starting from there, I had multiple problem statements for both supervised / unsupervised machine learning

Can I program a tailored-recommendation system based on my own criteria ?

What are the characteristics of movies/directors I like the most ?

What is the probability that I will like my next movie ?

Can I find the data ?

One of the objectives of sharing my work here is to find cinephile data-scientists who might be interested and, hopefully, contribute or share insights :) Other interesting leads : use tagline for NLP/Clustering/Genre guessing, leverage on budget/revenue, link with other data sources using the imdb normalized title, etc.

https://img11.hostingpics.net/pics/977004matrice.png" alt="Correlation matrix">

Motivation, Disclaimer and Acknowledgements

I've graduated from an french engineering school, majoring in artificial intelligence, but that was 17 years ago right in the middle of A.I-winter. Like a lot of white male rocket scientists, I've ended up in one of the leading european investment bank, quickly abandonning IT development to specialize in trading/risk project management and internal politics. My recent appointment in the Data Office made me aware of recent breakthroughts in datascience, and I thought that developing a side project would be an excellent occasion to learn something new. Plus it'd give me a well-needed credibility which too often lack decision makers when it comes to datascience.

I've worked on some of the features with Cédric Paternotte, a fellow friend of mine who is a professor of philosophy of sciences in La Sorbonne. Working with someone with a different background seem a good idea for motivation, creativity and rigor.

Kudos to www.themoviedb.org or www.wikipedia.com sites, who really have a great attitude towards open data. This is typically NOT the case of modern-bigdata companies who mostly keep data to themselves to try to monetize it. Such a huge contrast with imdb or instagram API, which generously let you grab your last 3 comments at a miserable rate. Even if 15 years ago this seemed a mandatory path to get services for free, I predict one day governments will need to break this data monopoly.

[Disclaimer : I apologize in advance for my engrish (I'm french ^-^), any bad-code I've written (there are probably hundreds of way to do it better and faster), any pseudo-scientific assumption I've made, I'm slowly getting back in statistics and lack senior guidance, one day I regress a non-stationary time series and the day after I'll discover I shouldn't have, and any incorrect use of machine-learning models]

https://img11.hostingpics.net/pics/898068408x161poweredbyrectanglegreen.png" alt="powered by themoviedb.org">
f
Bollywood VS Hollywood runtime comparison
figshare.com
txt
Updated Jan 26, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yoritam Nekrabooty (2019). Bollywood VS Hollywood runtime comparison [Dataset]. http://doi.org/10.6084/m9.figshare.7635752.v3
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7635752.v3
Dataset updated
Jan 26, 2019
Dataset provided by
figshare
Authors
Yoritam Nekrabooty
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
Hollywood
Description
ボリウッド映画とハリウッド映画（アメリカ映画）の上映時間の変遷を比較するためのデータセットとスクリプトです。データはDBpedia/Wikipediaから自動抽出したものを分析しやすいように手作業で整理しました。現在、1970年から2018年のデータがあります。Dataset & visualization script of Bollywood and Hollywood feature-length film runtimes (1970-2018) based on data publicly available on DBpedia/Wikipedia.Bollywood: https://en.wikipedia.org/wiki/Category:Lists_of_Bollywood_films_by_yearHollywood:https://en.wikipedia.org/wiki/Category:Lists_of_American_films_by_yearWork/runtime data were obtained from http://dbpedia.org/ when available.Otherwise, if the "Running time" field was present in the http://en.wikipedia.org/wiki page Infobox, the data in that field was used.The data were further manually screened to remove any entries for non-feature-length films such as short films and TV series; however, this screening was not exhaustive.
Biggest Netflix libraries in the world 2024
statista.com
ai-chatbox.pro
Updated Oct 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Biggest Netflix libraries in the world 2024 [Dataset]. https://www.statista.com/statistics/1013571/netflix-library-size-worldwide/
Explore at:
Dataset updated
Oct 21, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jul 2024
Area covered
World
Description
Industry data revealed that Slovakia had the most extensive Netflix media library worldwide as of July 2024, with over 8,500 titles available on the platform. Interestingly, the top 10 ranking was spearheaded by European countries. Where do you get the most bang for your Netflix buck? In February 2024, Liechtenstein and Switzerland were the countries with the most expensive Netflix subscription rates. Viewers had to pay around 21.19 U.S. dollars per month for a standard subscription. Subscribers in these countries could choose from between around 6,500 and 6,900 titles. On the other end of the spectrum, Pakistan, Egypt, and Nigeria are some of the countries with the cheapest Netflix subscription costs at around 2.90 to 4.65 U.S. dollars per month. Popular content on Netflix While viewing preferences can differ across countries and regions, some titles have proven particularly popular with international audiences. As of mid-2024, "Red Notice" and "Don't Look Up" were the most popular English-language movies on Netflix, with over 230 million views in its first 91 days available on the platform. Meanwhile, "Troll" ranks first among the top non-English language Netflix movies of all time. The monster film has amassed 103 million views on Netflix, making it the most successful Norwegian-language film on the platform to date.
PeakAffectDS
zenodo.org
zip
Updated Apr 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nick Greene; Steven R. Livingstone; Steven R. Livingstone; Lech Szymanski; Lech Szymanski; Nick Greene (2025). PeakAffectDS [Dataset]. http://doi.org/10.5281/zenodo.6403363
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6403363
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nick Greene; Steven R. Livingstone; Steven R. Livingstone; Lech Szymanski; Lech Szymanski; Nick Greene
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Contact Information

If you would like further information about PeakAffectDS, to purchase a commercial license, or if you experience any issues downloading files, please contact us at peakaffectds@gmail.com.

Description

PeakAffectDS contains 663 files (total size: 1.84 GB), consisting of 612 physiology files, and 51 perceptual rating files. The dataset contains 51 untrained research participants (39 female, 12 male), who had their body physiology recorded while watching movie clips validated to induce strong emotional reactions. Emotional conditions included: calm, happy, sad, angry, fearful, and disgust; along with baseline a neutral condition. Four physiology channels were recorded with a Biopac MP36 system: two facial muscles with fEMG (zygomaticus major, corrugator supercilii) using Ag/AgCl electrodes, heart activity with ECG using a 1-Lead, Lead II configuration, and respiration with a wearable strain-gauge belt. While viewing movie clips, participants indicated in real-time when they experienced a "peak" emotional event, including: chills, tears, or the startle reflex. After each clip, participants further rated their felt emotional state using a forced-choice categorical response measure, along with their felt Arousal and Valence. All data are provided in plaintext (.csv) format.

PeakAffectDS was created in the Affective Data Science Lab.

Physiology files

Each participant has 12 .CSV physiology files, consisting of 6 Emotional conditions, and 6 Neutral baseline conditions. All physiology channels were recorded at 2000 Hz. A 50Hz notch filter was then applied to fEMG and ECG channels to remove mains hum. Each .CSV file contains 6 columns, in order from left to right:

Sample timestamp (units: seconds)

EMG Zygomaticus (units: millivolts)

EMG Corrugator (units: millivolts)

ECG (units: millivolts)

Peak event makers: 0 = no event, 1 = chills, 2 = tears, 3 = startle

Perceptual files

There are 51 perceptual ratings files, one for each participant. Each .CSV file contains 4 columns, in order from left to right:

Filename of presented stimulus (see File naming Convention, below)

Felt emotional response: 1 = neutral, 2 = calm, 3 = happy, 4 = sad, 5 = angry, 6 = fearful, 7 = disgust

Felt Valence, ranging from: 1 = Very negative, to 7 = Very positive

Felt Arousal, ranging from: 1 = Very low, to 7 = Very high

File naming convention

Each of the 612 physiology files has a unique filename. The filename consists of a 3-part numerical identifier (e.g., 09-02-03.csv). The first identifier refers to the participant's ID (09), while the remaining two identifiers refer to the stimulus presented for that recording (02-03.mp4); these identifiers define the stimulus characteristics:

Participant: 01 = participant 1, 02 = participant 2, ..., 51 = participant 51.

Emotion: 01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust.

Stimulus set. For Emotional files: 01 = group 1, 02 = group 2, 03 = group 3. For Neutral files: 01 = instance 1, 02 = instance 2, ..., 06 = instance 6.

Filename example: 09-02-03.csv

Participant 9 (09)

Calm (02)

Stimulus Set 3 (03)

Filename example: 09-01-05.csv

Participant 9 (09)

Neutral (01)

Instance 5 (05)

Methods

A 1-way mixed-design was used, with a within-subjects factor Emotion (6 levels: Calm, Happy, Sad, Angry, Fearful, Disgust) and a between-subjects factor Stimulus Set (3 levels). Trials were blocked by Affect Condition (Baseline, Emotional), with each participant presented 6 blocked trials: Baseline (neutral), then Emotional (Calm, ..., Disgust). This design reduced potential contamination from preceeding emotional trials, by ensuring that participant's physiology began close to a resting baseline for emotional conditions.

Emotion was presented in pseudorandom order using a carryover balanced generalised Youden design, generated by the crossdes package in R. Eighteen emotional movie clips were used as stimuli, with three instances for each emotion category (6x3). Clips were then grouped into one of three Stimulus Sets, with participants assigned to a given Set using Block randomisation. For example, participants assigned to Stimulus Set 1 (PID: 1, 4, 7, ...) all saw the same movie clips, but these clips differed to those in Sets 2 and 3. Six Neutral baseline movie clips were used as stimuli, with all participants viewing the same neutral clips, with their order also generated with a Youden design.

Stimulus duration varied, with clips lasting several minutes. Lengthy clips without repetition were used to help ensure that participants became engaged, and experienced genuine, strong emotional responses. Participants were instructed to immediately indicate using the keyboard when experiencing a "peak" emotional event, including: chills, tears, or startle. Participants were permitted to indicate multiple events in a single trial, and identified the type of the evens at the trial feedback stage, along with ratings of emotional category, arousal, and valence. The concept of peak physiological events was explained at the beginning of the experiment, but the three states were not described as being associated with any particular emotion or valence.

License information

PeakAffectDS is released under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, CC BY-NC-SA 4.0.

Citing PeakAffectDS

Greene, N., Livingstone, S. R., & Szymanski, L. (2022). PeakAffectDB [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6403363
g
Dataset for: Simultaneous Chemical and Refractive Index Sensing in the 1-2.5...
data.griidc.org
search.dataone.org
Updated Mar 8, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wei-Chuan Shih (2017). Dataset for: Simultaneous Chemical and Refractive Index Sensing in the 1-2.5 micron Near-Infrared Wavelength Range on Nanoporous Gold Disks [Dataset]. http://doi.org/10.7266/N7FF3QRM
Explore at:
Unique identifier
https://doi.org/10.7266/N7FF3QRM
Dataset updated
Mar 8, 2017
Dataset provided by
GRIIDC
Authors
Wei-Chuan Shih
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
We developed a new method to obtain chemical and refractive index sensing between the 1 and 2.5 micron near-infrared wavelength on nanoporous gold (NPG) disks. We fabricated NPG disks in the laboratory by sputtering a gold-silver alloy film onto a glass substrate at approximately 80 nm thickness. Polystyrene beads were deposited onto the alloy film in a single layer and were reduced in size using an oxygen plasma treatment. The bead pattern was transferred onto the alloy using a sputter-etch method in argon plasma. After etching, the alloy was sonicated in chloroform to remove residual beads. The disks were then dealloyed using nitric acid. We measured infrared absorption using dispersive scanning UV-Vis-NIR and FT-IR inteferometric spectrometers. This dataset reports the extinction spectra of water on NPG disks with diameters of either 350 or 600 nm. For NPG disks of 350 nm diameter the extinction spectra are also provided for 6 other solvents with different refractive indices: salt water, ethanol, hexane, iso-octane, hexadecane, and toluene. We examined the surface-enhanced near-infrared absorption of 350 nm and 600 nm diameter NPG disks with a self-assembled monolayer (SAM) of octadecanethiol (ODT) and report the extinction spectra in this dataset. The surface-enhanced near-infrared absorption (SENIRA) spectra of each of hexadecane, dodecane, siloxane, pyrene, and Louisiana sweet grade crude oil on either 350 nm or 600 nm NPG disks is reported in this dataset. Lastly we deposited a film poly(methyl methacrylate) (PMMA) of varying thickness onto the NPG disk substrate. The PMMA films varied in thickness from 50-150 nm. The surface-enhanced near-infrared absorption spectra of 350 nm and 600 nm NPG disks on films of 50-150 nm in thickness is reported in this dataset, as well as the wavelength shift at 1398 nm. This dataset is associated with the paper: Shih, W. - C., Santos, G. M., Zhao, F., Zenasni, O., & Arnob, M. M. P. (2016). Simultaneous Chemical and Refractive Index Sensing in the 1-2.5 μm Near-Infrared Wavelength Range on Nanoporous Gold Disks. Nano Lett., 16(7), 4641–4647, doi:10.1021/acs.nanolett.6b01959.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Datasimple (2025). IMDb Movie Genre Classification Dataset [Dataset]. https://www.opendatabay.com/data/web-social/2e13f07c-9c7c-4856-80c0-1a027f82b3c9

IMDb Movie Genre Classification Dataset

Explore at:

.undefinedAvailable download formats

Dataset updated

Jun 23, 2025

Dataset authored and provided by

Datasimple

Area covered

Entertainment & Media Consumption

Description

Overview of the Data movies_overview.csv:

Columns: title: The movie title overview: A brief description or synopsis of the movie genre_ids: One or more genre identifiers (which could be multi-label) movies_genres.csv:

Columns: id: Genre identifier name: The corresponding genre name This setup allows you to map each movie’s genre_ids from the movies_overview file to its actual genre names using the movies_genres mapping.

Proposed NLP Task: Multi-Label Genre Classification Objective: Create an NLP model that, given a movie’s overview, predicts the correct genre(s) for that movie. Since a movie may belong to multiple genres, this is a multi-label classification task.
Challenge Breakdown a. Problem Statement Participants are tasked with designing and training an NLP model that takes as input the movie overview text and outputs one or more genres. The challenge could encourage approaches that span from classical text classification methods (e.g., TF-IDF with logistic regression) to modern transformer-based models (e.g., BERT, RoBERTa).

b. Data Preprocessing Text Cleaning & Tokenization: Clean the movie overviews (e.g., lowercasing, removing special characters) and tokenize the text. Label Preparation: Transform the genre_ids into a multi-label format. Use movies_genres.csv to convert these IDs into genre names. Data Splitting: Create training, validation, and test sets, ensuring the distribution of genres is well represented. c. Baseline Models Encourage participants to start with simple models (e.g., bag-of-words, TF-IDF combined with logistic regression or random forest) and progress towards deep learning approaches like LSTM-based networks or transformer models.

d. Evaluation Metrics Since this is a multi-label task, consider evaluation metrics such as:

F1 Score (Macro / Micro): Balances precision and recall. Hamming Loss: Measures how many labels are incorrectly predicted. Subset Accuracy: For stricter evaluation (all labels must match exactly). 4. Additional Considerations Baseline Code & Notebooks: Provide a starter notebook with initial data loading, preprocessing, and a simple baseline model. This helps lower the entry barrier for participants who may be new to multi-label NLP tasks.

Evaluation Server & Leaderboard: Ensure that your Kaggle competition setup allows for automatic evaluation using the selected metrics and that a public leaderboard is available for continuous feedback.

Documentation & Discussion: Include detailed documentation describing the datasets, the task requirements, and the evaluation procedure. Additionally, host a discussion forum to foster collaboration among participants.

Final Remarks This challenge not only tests participants’ ability to handle multi-label classification and text processing but also encourages them to explore advanced NLP techniques and model evaluation strategies. The combination of movie overviews and genre mapping offers a rich and interesting dataset for an engaging Kaggle competition.

Original Data Source: IMDb Movie Genre Classification Dataset

Clear search

Close search

Google apps

Main menu

IMDb Movie Genre Classification Dataset

Movie & Rating Data - 2025

Data from: imdb

IMDb India Movies

Task Details

Every dataset has a story and this set is pulled from IMDb.com of all the Indian movies on the platform. Clean this data by removing missing values or adding average values this process will help to manipulate the data to help with your EDA.

Analyze data and provide some trends.

Using social network information to discover truth of movie ranking

LifeQA dataset features

Top 6000 TMDB Italian Movies

Introduction

Contents

Data from: EmoKey Moments Muse EEG Dataset (EKM-ED): A Comprehensive...

WIRED ICM Sample Dataset - Workshop on Intracranial Recordings in Humans,...

WIRED ICM TUTORIAL DATA

Tasks:

Notes:

References

ForrestGump-MEG

sst2

Data from: Individual differences in neural event segmentation of continuous...

MPI Sintel Dataset

Data from: Template-stripped substrates with solvent-impermeable metal thin...

OpenSubtitles Dataset

350 000+ movies from themoviedb.org

Context

Content

Inspiration

Motivation, Disclaimer and Acknowledgements

Bollywood VS Hollywood runtime comparison

Biggest Netflix libraries in the world 2024

PeakAffectDS

Dataset for: Simultaneous Chemical and Refractive Index Sensing in the 1-2.5...

IMDb Movie Genre Classification Dataset