20 datasets found
  1. o

    IMDb Movie Genre Classification Dataset

    • opendatabay.com
    • kaggle.com
    .undefined
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). IMDb Movie Genre Classification Dataset [Dataset]. https://www.opendatabay.com/data/web-social/2e13f07c-9c7c-4856-80c0-1a027f82b3c9
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 23, 2025
    Dataset authored and provided by
    Datasimple
    Area covered
    Entertainment & Media Consumption
    Description
    1. Overview of the Data movies_overview.csv:

    Columns: title: The movie title overview: A brief description or synopsis of the movie genre_ids: One or more genre identifiers (which could be multi-label) movies_genres.csv:

    Columns: id: Genre identifier name: The corresponding genre name This setup allows you to map each movie’s genre_ids from the movies_overview file to its actual genre names using the movies_genres mapping.

    1. Proposed NLP Task: Multi-Label Genre Classification Objective: Create an NLP model that, given a movie’s overview, predicts the correct genre(s) for that movie. Since a movie may belong to multiple genres, this is a multi-label classification task.

    2. Challenge Breakdown a. Problem Statement Participants are tasked with designing and training an NLP model that takes as input the movie overview text and outputs one or more genres. The challenge could encourage approaches that span from classical text classification methods (e.g., TF-IDF with logistic regression) to modern transformer-based models (e.g., BERT, RoBERTa).

    b. Data Preprocessing Text Cleaning & Tokenization: Clean the movie overviews (e.g., lowercasing, removing special characters) and tokenize the text. Label Preparation: Transform the genre_ids into a multi-label format. Use movies_genres.csv to convert these IDs into genre names. Data Splitting: Create training, validation, and test sets, ensuring the distribution of genres is well represented. c. Baseline Models Encourage participants to start with simple models (e.g., bag-of-words, TF-IDF combined with logistic regression or random forest) and progress towards deep learning approaches like LSTM-based networks or transformer models.

    d. Evaluation Metrics Since this is a multi-label task, consider evaluation metrics such as:

    F1 Score (Macro / Micro): Balances precision and recall. Hamming Loss: Measures how many labels are incorrectly predicted. Subset Accuracy: For stricter evaluation (all labels must match exactly). 4. Additional Considerations Baseline Code & Notebooks: Provide a starter notebook with initial data loading, preprocessing, and a simple baseline model. This helps lower the entry barrier for participants who may be new to multi-label NLP tasks.

    Evaluation Server & Leaderboard: Ensure that your Kaggle competition setup allows for automatic evaluation using the selected metrics and that a public leaderboard is available for continuous feedback.

    Documentation & Discussion: Include detailed documentation describing the datasets, the task requirements, and the evaluation procedure. Additionally, host a discussion forum to foster collaboration among participants.

    1. Final Remarks This challenge not only tests participants’ ability to handle multi-label classification and text processing but also encourages them to explore advanced NLP techniques and model evaluation strategies. The combination of movie overviews and genre mapping offers a rich and interesting dataset for an engaging Kaggle competition.

    Original Data Source: IMDb Movie Genre Classification Dataset

  2. Movie & Rating Data - 2025

    • kaggle.com
    Updated Feb 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayberk URAL (2025). Movie & Rating Data - 2025 [Dataset]. https://www.kaggle.com/datasets/ayberkural/movielens-movie-csv-and-rating-csv/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 23, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ayberk URAL
    Description

    🎬 Movie Ratings & Metadata Dataset This dataset provides a structured collection of movie ratings along with essential metadata, making it a valuable resource for data analysis and machine learning projects. It consists of two main tables:

    📊 Ratings Data This table contains user-generated ratings for various movies.

    userId → Unique identifier for each user. movieId → Unique identifier for each movie (linked to the movie metadata table). rating → Rating given by the user (typically on a scale of 0-5). timestamp → Time when the rating was given (Unix format). 🎞️ Movie Metadata This table provides details about the movies being rated.

    movieId → Unique identifier for each movie (linked to the ratings table). title → Movie title, including the release year. genres → List of genres associated with the movie (e.g., Action, Comedy, Drama). 🎯 Use Cases: ✔️ Recommender Systems – Build personalized movie recommendation models. ✔️ Trend Analysis – Explore how audience preferences change over time. ✔️ Sentiment & Popularity Analysis – Compare ratings across different genres.

    This dataset is clean, structured, and ready for exploration. 🚀

    👉 Start analyzing and uncover interesting movie insights! 🎥

  3. h

    Data from: imdb

    • huggingface.co
    Updated Aug 3, 2003
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford NLP (2003). imdb [Dataset]. https://huggingface.co/datasets/stanfordnlp/imdb
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 3, 2003
    Dataset authored and provided by
    Stanford NLP
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for "imdb"

      Dataset Summary
    

    Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

      Supported Tasks and Leaderboards
    

    More Information Needed

      Languages
    

    More Information Needed

      Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
    
  4. IMDb India Movies

    • kaggle.com
    Updated Jun 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adrian McMahon (2021). IMDb India Movies [Dataset]. https://www.kaggle.com/adrianmcmahon/imdb-india-movies/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 18, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Adrian McMahon
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    India
    Description

    Task Details

    Every dataset has a story and this set is pulled from IMDb.com of all the Indian movies on the platform. Clean this data by removing missing values or adding average values this process will help to manipulate the data to help with your EDA.

    Analyze data and provide some trends.

    • Year with best rating
    • Does length of movie have any impact with the rating?
    • Top 10 movies according to rating per year and overall.
    • Number of popular movies released each year.
    • Counting the number of votes which movies preformed better in rating per year and overall.
    • Any other trends or future prediction you may have
    • Which director directed the most movies
    • Which actor starred in the movie
    • Any other trends you can find

    Thank you for viewing my dataset, looking forward to seeing some codes.

  5. D

    Using social network information to discover truth of movie ranking

    • researchdata.ntu.edu.sg
    tsv, txt
    Updated Jun 10, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DR-NTU (Data) (2018). Using social network information to discover truth of movie ranking [Dataset]. http://doi.org/10.21979/N9/L5TTRW
    Explore at:
    tsv(4143), tsv(26553), txt(1857)Available download formats
    Dataset updated
    Jun 10, 2018
    Dataset provided by
    DR-NTU (Data)
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The real dataset consists of movie evaluations from IMDB, which provides a platform where individuals can evaluate movies on a scale of 1 to 10. If a user rates a movie and clicks the share button, a Twitter message is generated. We then extract the rating from the Twitter message. We treat the ratings on the IMDB website as the event truths, which are based on the aggregated evaluations from all users, whereas our observations come from only a subset of users who share their ratings on Twitter. Using the Twitter API, we collect information about the follower and following relationships between individuals that generate movie evaluation Twitter messages. To better show the influence of social network information on event truth discovery, we delete small subnetworks that consist of less than 5 agents. The final dataset we use consists of 2266 evaluations from 209 individuals on 245 movies (events) and also the social network between these 209 individuals. We regard the social network to be undirected as both follower or following relationships indicate that the two users have similar taste.

  6. u

    LifeQA dataset features

    • deepblue.lib.umich.edu
    Updated Mar 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Castro, Santiago; Azab, Mahmoud; Stroud, Jonathan C.; Noujaim, Cristina; Wang, Ruoyao; Deng, Jia; Mihalcea, Rada (2025). LifeQA dataset features [Dataset]. http://doi.org/10.7302/nbj0-np80
    Explore at:
    Dataset updated
    Mar 7, 2025
    Dataset provided by
    Deep Blue Data
    Authors
    Castro, Santiago; Azab, Mahmoud; Stroud, Jonathan C.; Noujaim, Cristina; Wang, Ruoyao; Deng, Jia; Mihalcea, Rada
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We introduce LifeQA, a benchmark dataset for video question answering focusing on daily real-life situations. Current video question-answering datasets consist of movies and TV shows. However, it is well-known that these visual domains do not represent our day-to-day lives. Movies and TV shows, for example, benefit from professional camera movements, clean editing, crisp audio recordings, and scripted dialog between professional actors. While these domains provide a large amount of data for training models, their properties make them unsuitable for testing real-life question-answering systems. Our dataset, by contrast, consists of video clips that represent only real-life scenarios. We collect 275 such video clips and over 2.3k multiple-choice questions. In this paper, we analyze the challenging but realistic aspects of LifeQA and apply several state-of-the-art video question-answering models to provide benchmarks for future research.

    For more information, refer to https://lit.eecs.umich.edu/lifeqa/.

  7. Top 6000 TMDB Italian Movies

    • kaggle.com
    Updated May 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mateusz Miroslaw Lis (2023). Top 6000 TMDB Italian Movies [Dataset]. https://www.kaggle.com/datasets/mateuszmiroslawlis/top6000-tmdb-italian-films
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 9, 2023
    Dataset provided by
    Kaggle
    Authors
    Mateusz Miroslaw Lis
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Introduction

    This dataset provides information on the top 6000 (actually 6118) most popular italian movies available on TMDB as of May 2023. The data was collected by a Python script through the TMDB API, entries were then filtered to remove any film that had "overview", "release_date" and "vote_average" data missing. Some details such as "genre", "duration", etc. where omitted but they can be easily retrieved with TMDB API by use of movie_id parameter from the "id" column.

    Contents

    • id: Contains the TMDB's id for the movie;
    • title: Contains the title of the movie;
    • overview: Contains a description of the movie, usually a logline;
    • release_date: Contains the release date of the movie;
    • vote_average: Contains the arithmetic average of the user's votes;
  8. u

    Data from: EmoKey Moments Muse EEG Dataset (EKM-ED): A Comprehensive...

    • produccioncientifica.ugr.es
    • data.niaid.nih.gov
    • +1more
    Updated 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francisco M. Garcia-Moreno; Marta Badenes-Sastre; Francisco M. Garcia-Moreno; Marta Badenes-Sastre (2020). EmoKey Moments Muse EEG Dataset (EKM-ED): A Comprehensive Collection of Muse S EEG Data and Key Emotional Moments [Dataset]. https://produccioncientifica.ugr.es/documentos/668fc432b9e7c03b01bd6125
    Explore at:
    Dataset updated
    2020
    Authors
    Francisco M. Garcia-Moreno; Marta Badenes-Sastre; Francisco M. Garcia-Moreno; Marta Badenes-Sastre
    Description

    EmoKey Moments Muse EEG Dataset (EKM-ED): A Comprehensive Collection of Muse S EEG Data and Key Emotional Moments

    Dataset Description:

    The EmoKey Moments EEG Dataset (EKM-ED) is an intricately curated dataset amassed from 47 participants, detailing EEG responses as they engage with emotion-eliciting video clips. Covering a spectrum of emotions, this dataset holds immense value for those diving deep into human cognitive responses, psychological research, and emotion-based analyses.

    Dataset Highlights:

    Precise Timestamps: Capturing the exact millisecond of EEG data acquisition, ensuring unparalleled granularity.

    Brainwave Metrics: Illuminating the variety of cognitive states through the prism of Delta, Theta, Alpha, Beta, and Gamma waves.

    Motion Data: Encompassing the device's movement in three dimensions for enhanced contextuality.

    Auxiliary Indicators: Key elements like the device's positioning, battery metrics, and user-specific actions are meticulously logged.

    Consent and Ethics: The dataset respects and upholds privacy and ethical standards. Every participant provided informed consent. This endeavor has received the green light from the Ethics Committee at the University of Granada, documented under the reference: 2100/CEIH/2021.

    A pivotal component of this dataset is its focus on "key moments" within the selected video clips, honing in on periods anticipated to evoke heightened emotional responses.

    Curated Video Clips within Dataset:

        Film
        Emotion
        Duration (seconds)
    
    
    
    
        The Lover
        Baseline
        43
    
    
        American History X
        Anger
        106
    
    
        Cry Freedom
        Sadness
        166
    
    
        Alive
        Happiness
        310
    
    
        Scream
        Fear
        395
    

    The cornerstone of EKM-ED is its innovative emphasis on these key moments, bringing to light the correlation between distinct cinematic events and specific EEG responses.

    Key Emotional Moments in Dataset:

        Film
        Emotion
        Key moment timestamps (seconds)
    
    
    
    
        American History X
        Anger
        36, 57, 68
    
    
        Cry Freedom
        Sadness
        112, 132, 154
    
    
        Alive
        Happiness
        227, 270, 289
    
    
        Scream
        Fear
        23, 42, 79, 226, 279, 299, 334
    

    Citation: Gilman, T. L., et al. (2017). A film set for the elicitation of emotion in research. Behavior Research Methods, 49(6). Link to the study

    With its unparalleled depth and focus, the EmoKey Moments EEG Dataset aims to advance research in fields such as neuroscience, psychology, and affective computing, providing a comprehensive platform for understanding and analyzing human emotions through EEG data.

    ——————————————————————————————————— FOLDER STRUCTURE DESCRIPTION ———————————————————————————————————

    • questionnaires: all there response questionnaires (Spanish); raw and preprocessed Including SAM | ——preprocessed: Ficha_Evaluacion_Participante_SAM_Refactored.csv: the SAM responses for every film clip

    • key_moments: the key moment timestamps for every emotion’s clip

    • muse_wearable_data: XXXX | |—raw |——1: ID = 1 of subject |————muse: EEG data of Muse device |—————————ANGER_XXX.csv : leg data of the anger elicitation |—————————FEAR_XXX.csv : leg data of the fear elicitation |—————————HAPPINESS_XXX.csv : leg data of the happiness elicitation |—————————SADNESS_XXX.csv : leg data of the sadness elicitation |————order: film elicitation order of play: For example: HAPPINESS,SADNESS,ANGER,FEAR … | |—preprocessed |——unclean-signals: without removing EEG artifacts, noise, etc. |————muse: EEG data of Muse device |—————————0.0078125: data downsampled to 128 Hz from 256Hz recorded |——clean-signals: removed EEG artifacts, noise, etc. |————muse: EEG data of Muse device |—————————0.0078125: data downsampled to 128 Hz from 256Hz recorded

    The ethical consent for this dataset was provided by La Comisión de Ética en Investigación de la Universidad de Granada, as documented in the approval titled: 'DETECCIÓN AUTOMÁTICA DE LAS EMOCIONES BÁSICAS Y SU INFLUENCIA EN LA TOMA DE DECISIONES MEDIANTE WEARABLES Y MACHINE LEARNING' registered under 2100/CEIH/2021.

  9. WIRED ICM Sample Dataset - Workshop on Intracranial Recordings in Humans,...

    • openneuro.org
    Updated Mar 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liberty S. Hamilton; Maansi Desai; Alyssa Field (2024). WIRED ICM Sample Dataset - Workshop on Intracranial Recordings in Humans, Epilepsy, DBS [Dataset]. http://doi.org/10.18112/openneuro.ds004993.v1.1.2
    Explore at:
    Dataset updated
    Mar 1, 2024
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    Liberty S. Hamilton; Maansi Desai; Alyssa Field
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    WIRED ICM TUTORIAL DATA

    Contributors: Liberty S. Hamilton, PhD, Maansi Desai, PhD, Alyssa Field, MEd

    Email: liberty.hamilton@austin.utexas.edu

    This is a sample BIDS dataset for the WIRED ICM course in Paris, France in March 2024.

    This contains intracranial recordings collected by the Hamilton Lab at the University of Texas at Austin. These recordings include examples of evoked data during natural listening tasks along with some examples of seizure-related activity and vagus nerve stimulator (VNS) artifact for illustrative purposes. All procedures were approved by the University of Texas at Austin Institutional Review Board.

    Funding: Support was provided by the National Institutes of Health National Institute on Deafness and Other Communication Disorders (R01 DC018579, to LSH).

    Tasks:

    1. movietrailers - this task involves patients listening to movie clips from various Pixar, Disney, Dreamworks, and other movies. We have published previously using these stimuli in EEG (Desai et al. 2021).
    2. timit4 and timit5 - these tasks involve patients listening to subsets of the TIMIT acoustic phonetic corpus (Garofolo et al 1993). The events provided in the dataset mark the onset and offset of each sentence. In timit4, each sentence is unique, while in timit5, 10 sentences are repeated 10 times. This is the same stimulus set used in Mesgarani et al. 2014, Hamilton et al. 2018, Hamilton et al. 2021, and Desai et. al 2021.

    Notes:

    • The movie trailer data for subject W1 was acquired at the start of a generalized tonic clonic seizure, and the research session was terminated. Large, synchronized spikes can be observed on multiple channels on the right parietal grid throughout the iEEG data.
    • The TIMIT data for subject W2 is an example of fairly clean sentence evoked data.
    • The TIMIT data for subject W3 is a good example of on-and-off VNS artifact. The VNS has a strong artifact at ~20 Hz. Some patients with epilepsy may have these implanted devices to help control their seizures, so you should know how to spot artifact-related activity. Despite these artifacts, the evoked responses to sentences are quite strong.
    • The acquisition number (B3, B8, etc) has to do with the order in which this task was run relative to other tasks in an iEEG session, and can be ignored here.

    References

    • Appelhoff, S., Sanderson, M., Brooks, T., Vliet, M., Quentin, R., Holdgraf, C., Chaumon, M., Mikulan, E., Tavabi, K., Höchenberger, R., Welke, D., Brunner, C., Rockhill, A., Larson, E., Gramfort, A. and Jas, M. (2019). MNE-BIDS: Organizing electrophysiological data into the BIDS format and facilitating their analysis. Journal of Open Source Software 4: (1896). https://doi.org/10.21105/joss.01896
    • Desai, M., Holder, J., Villarreal, C., Clark, N., Hoang, B., & Hamilton, L. S. (2021). Generalizable EEG encoding models with naturalistic audiovisual stimuli. Journal of Neuroscience, 41(43), 8946-8962.
    • Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., & Pallett, D. S. (1993). DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon technical report n, 93, 27403.
    • Hamilton, L. S., Edwards, E., & Chang, E. F. (2018). A spatial map of onset and sustained responses to speech in the human superior temporal gyrus. Current Biology, 28(12), 1860-1871.
    • Hamilton, L. S., Oganian, Y., Hall, J., & Chang, E. F. (2021). Parallel and distributed encoding of speech across human auditory cortex. Cell, 184(18), 4626-4639.
    • Holdgraf, C., Appelhoff, S., Bickel, S., Bouchard, K., D'Ambrosio, S., David, O., … Hermes, D. (2019). iEEG-BIDS, extending the Brain Imaging Data Structure specification to human intracranial electrophysiology. Scientific Data, 6, 102. https://doi.org/10.1038/s41597-019-0105-7
    • Mesgarani, N., Cheung, C., Johnson, K., & Chang, E. F. (2014). Phonetic feature encoding in human superior temporal gyrus. Science, 343(6174), 1006-1010.
  10. ForrestGump-MEG

    • openneuro.org
    Updated May 17, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xingyu Liu; Yuxuan Dai; Hailun Xie; Zonglei Zhen (2021). ForrestGump-MEG [Dataset]. http://doi.org/10.18112/openneuro.ds003633.v1.0.1
    Explore at:
    Dataset updated
    May 17, 2021
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    Xingyu Liu; Yuxuan Dai; Hailun Xie; Zonglei Zhen
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    *****ForrestGump-MEG: A audio-visual movie watching MEG dataset*****

    For details please refer to our paper on [].

    This dataset contains MEG data recorded from 11 subjects while watching the 2h long Chinese-dubbed audio-visual movie 'Forrest Gump'. The data were acquired with a 275-channel CTF MEG. Auxiliary data (T1w) as well as derivation data such as preprocessed data and MEG-MRI co-registration are also included.

    Pre-process procedure description

    The T1w images stored as NIFTI files were minimally-preprocessed using the anatomical preprocessing pipeline from fMRIPrep with default settings.

    MEG data were pre-processed using MNE following a three-step procedure: 1. bad channels were detected and removed. 2. a high-pass filter of 1 Hz was applied to remove possible slow drifts from the continuous MEG data. 3. artifacts removal was performed with ICA.

    Stimulus material

    The audio-visual stimulus material was from the Chinese-dubbed 'Forrest Gump' DVD released in 2013 (ISBN: 978-7-7991-3934-0), which cannot be publicly released here due to copyright restrictions.

    Dataset content overview

    the data were organized following the MEG-BIDS using MNE-BIDS toolbox.

    the pre-processed MEG data

    The preprocessed MEG recordings including the preprocessed MEG data, the event files, the ICA decomposition and label files and the MEG-MRI coordinate transformation file are hosted here.

    芒聰聰芒聰聙芒聰聙 ./derivatives/preproc_meg-mne_mri-fmriprep/sub-xx/ses-movie/meg/
      芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_coordsystem.json
      芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_channels.tsv
      芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_decomposition.tsv
      芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_events.tsv
      芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_ica.fif.gz
      芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_meg.fif
      芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_meg.json
      芒聰聹芒聰聙芒聰聙 ...
      芒聰聰芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_trans.fif
    

    the pre-processed MRI data

    The preprocessed MRI volume, reconstructed surface, and other associations including transformation files are hosted here

    芒聰聰芒聰聙芒聰聙 ./derivatives/preproc_meg-mne_mri-fmriprep/sub-xx/ses-movie/anat/
      芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_desc-preproc_T1w.nii.gz
      芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_hemi-L_inflated.surf.gii
      芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_hemi-L_midthickness.surf.gii
      芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_hemi-L_pial.surf.gii
      芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_hemi-L_smoothwm.surf.gii
      芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_hemi-R_inflated.surf.gii
      芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_hemi-R_midthickness.surf.gii
      芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_hemi-R_pial.surf.gii
      芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_hemi-R_smoothwm.surf.gii
      芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_space-MNI152NLin2009cAsym_desc-preproc_T1w.nii.gz
      芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_space-MNI152NLin6Asym_desc-preproc_T1w.nii.gz
      芒聰聰芒聰聙芒聰聙 ...
    

    the FreeSurfer surface data, the high-resolution head surface and the MRI-fiducials are provided here

    芒聰聰芒聰聙芒聰聙 ./derivatives/preproc_meg-mne_mri-fmriprep/sourcedata/
      芒聰聰芒聰聙芒聰聙 freesurfer  
        芒聰聰芒聰聙芒聰聙 sub-xx
        芒聰聰芒聰聙芒聰聙 ...
    

    the raw data

    芒聰聰芒聰聙芒聰聙 ./sub-xx/ses-movie/  
      芒聰聹芒聰聙芒聰聙 meg/
      芒聰聜 芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_coordsystem.json
      芒聰聜 芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_channels.tsv
      芒聰聜 芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_events.tsv
      芒聰聜 芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_meg.ds
      芒聰聜 芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_meg.json
      芒聰聜 芒聰聰芒聰聙芒聰聙 ...    
      芒聰聰芒聰聙芒聰聙 anat/
        芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_T1w.json
        芒聰聰芒聰聙芒聰聙 sub-xx_ses-movie_T1w.nii.gz
    
  11. h

    sst2

    • huggingface.co
    Updated May 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford NLP (2023). sst2 [Dataset]. https://huggingface.co/datasets/stanfordnlp/sst2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 8, 2023
    Dataset authored and provided by
    Stanford NLP
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Dataset Card for [Dataset Name]

      Dataset Summary
    

    The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. The corpus is based on the dataset introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/sst2.

  12. Data from: Individual differences in neural event segmentation of continuous...

    • openneuro.org
    Updated Jan 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Clara Sava-Segal; Chandler Richards; Megan Leung; Emily S. Finn (2024). Individual differences in neural event segmentation of continuous experiences [Dataset]. http://doi.org/10.18112/openneuro.ds004516.v1.0.2
    Explore at:
    Dataset updated
    Jan 25, 2024
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    Clara Sava-Segal; Chandler Richards; Megan Leung; Emily S. Finn
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Four stimuli were used:

    (Note that the versions presented to subjects were edited to remove credits and title pages; these edited versions are available upon request.)

    Stimuli Description - Iteration (https://youtu.be/c53fGdK84rc; 12:27 min:s) is a sci-fi movie that follows a female character as she goes through multiple iterations of waking up and trying to escape a facility. A male character appears toward the end to help her. - Defeat (https://youtu.be/6yN9VH_4GSQ; 7:57 min:s) follows a family of three (mother, two children) as the brother bullies his sister and she builds a time machine to go back and get revenge. - Growth (https://youtu.be/JyvFXBA3O8o; 8:27 min:s) follows a family of four (mother, father, two brothers) as the children grow up and eventually move out amid some family conflict. - Lemonade (https://youtu.be/Av07QiqmsoA; 7:27 min:s) is a Rube-Goldberg machine consisting of a series of objects that move throughout a house and ends in the pouring of a cup of lemonade. This movie was lightly edited to remove fleeting shots of human characters. Iteration and Defeat both contained screen cuts (continuity editing), whereas Growth and Lemonade were shot in a continuous fashion with the camera panning smoothly from one scene to the next.

    Runs are a bit longer than the movie stimuli themselves. We dropped the first 2 TRs and the last 12 TRs for each functional run. Please reach out if you have any questions.

    For Growth, this corresponds to TRs: 2:505
    For Lemonade, this corresponds to TRs: 2:449
    For Defeat, this corresponds to TRs: 2:480
    For Iteration, this corresponds to TRs: 2:748

  13. P

    MPI Sintel Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated May 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel J. Butler; Jonas Wulff; Garrett B. Stanley; Michael J. Black (2021). MPI Sintel Dataset [Dataset]. https://paperswithcode.com/dataset/mpi-sintel
    Explore at:
    Dataset updated
    May 13, 2021
    Authors
    Daniel J. Butler; Jonas Wulff; Garrett B. Stanley; Michael J. Black
    Description

    MPI (Max Planck Institute) Sintel is a dataset for optical flow evaluation that has 1064 synthesized stereo images and ground truth data for disparity. Sintel is derived from open-source 3D animated short film Sintel. The dataset has 23 different scenes. The stereo images are RGB while the disparity is grayscale. Both have resolution of 1024×436 pixels and 8-bit per channel.

  14. f

    Data from: Template-stripped substrates with solvent-impermeable metal thin...

    • figshare.com
    • acs.figshare.com
    zip
    Updated May 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cynthia Avedian; Christina D. M. Trang; Michael S. Inkpen (2025). Template-stripped substrates with solvent-impermeable metal thin films [Dataset]. http://doi.org/10.1021/acsnanoscienceau.5c00018.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 21, 2025
    Dataset provided by
    ACS Publications
    Authors
    Cynthia Avedian; Christina D. M. Trang; Michael S. Inkpen
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Template-stripped substrates provide on-demand access to clean, ultraflat gold surfaces, avoiding the need for laborious cleaning procedures or the use of expensive single-crystal electrodes. While these gold/adhesion layer/support sandwich structures are most conveniently prepared through the application of epoxy or optical adhesives, such composites exhibit instabilities in organic solvents that limit their wider application. Here we demonstrate that substrates with solvent-impermeable metal films can be used in previously problematic chemical environments after integration into a protective, custom-built (electrochemical) flow cell. We apply our methodology to probe different self-assembled monolayers, observing reproducible alkanethiol reductive desorption features, an exemplary redox response using 6-(ferrocenyl)hexanethiol, and corroborate findings that cobalt(II) bis(terpyridine) assemblies exhibit a low coverage. This work significantly extends the utility of these substrates, relative to mechanically polished or freshly deposited alternatives, particularly for studies of systems involving adsorbed molecules whose properties are strongly influenced by the nanoscopic features of the metal-solution interface.

  15. P

    OpenSubtitles Dataset

    • paperswithcode.com
    Updated Jul 10, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pierre Lison; J{\"o}rg Tiedemann (2022). OpenSubtitles Dataset [Dataset]. https://paperswithcode.com/dataset/opensubtitles
    Explore at:
    Dataset updated
    Jul 10, 2022
    Authors
    Pierre Lison; J{\"o}rg Tiedemann
    Description

    OpenSubtitles is collection of multilingual parallel corpora. The dataset is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages.

  16. 350 000+ movies from themoviedb.org

    • kaggle.com
    zip
    Updated Oct 12, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephanerappeneau (2017). 350 000+ movies from themoviedb.org [Dataset]. https://www.kaggle.com/stephanerappeneau/350-000-movies-from-themoviedborg
    Explore at:
    zip(70483259 bytes)Available download formats
    Dataset updated
    Oct 12, 2017
    Authors
    Stephanerappeneau
    Description

    Context

    I love movies.

    I tend to avoid marvel-transformers-standardized products, and prefer a mix of classic hollywood-golden-age and obscure polish artsy movies. Throw in an occasional japanese-zombie-slasher-giallo as an alibi. Good movies don't exist without bad movies.

    On average I watch 200+ movies each year, with peaks at more than 500 movies. Nine years ago I started to log my movies to avoid watching the same movie twice, and also assign scores. Over the years, it gave me a couple insights on my viewing habits but nothing more than what a tenth-grader would learn at school.

    I've recently suscribed to Netflix and it pains me to see the global inefficiency of recommendation systems for people like me, who mostly swear by "La politique des auteurs". It's a term coined by famous new-wave french movie critic André Bazin, meaning that the quality of a movie is essentially linked to the director and it's capacity to execute his vision with his crew. We could debate it depends on movie production pipeline, but let's not for now. Practically, what it means, is that I essentially watch movies from directors who made films I've liked.

    I suspect Neflix calibrate their recommandation models taking into account the way the "average-joe" chooses a movie. A few months ago I had read a study based on a survey, showing that people chose a movie mostly based on genre (55%), then by leading actors (45%). Director or Release Date were far behind around 10% each. It is not surprising, since most people I know don't care who the director is. Lots of US blockbusters don't even mention it on the movie poster. I am aware that collaborative filtering is based on user proximity , which I believe decreases (or even eliminates) the need to characterize a movie. So here I'm more interested in content based filtering which is based on product proximity for several reasons :

    • Users tastes are not easily accessible. It is, after all, Netflix treasure chest

    • Movie offer on Netflix is so bad for someone who likes author's movies that it wouldn't help

    • Modeling a movie intrinsic qualities is a nice challenge

    Enough.

    "*The secret of getting ahead is getting started*" (Mark Twain)

    https://img11.hostingpics.net/pics/117765networkgraph.png" alt="network graph">

    Content

    The primary source is www.themoviedb.org. If you watch obscure artsy romanian homemade movies you may find only 95% of your movies referenced...but for anyone else it should be in the 98%+ range.

    Here is overview of the available sources that I've tried :

    • Imdb.com free csv dumps (ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/temporaryaccess/) are badly documented, incomplete, loosely structured and impossible to join/merge. There's an API hosted by Amazon Web Service : 1€ every 100 000 requests. With around 1 million movies, it could become expensive also features are bare. So I've searched for other sources.

    www.themoviedb.org is based on crowdsourcing and has an excellent API, limited to 40 requests every 10 seconds. It is quite generous, well documented, and enough to sweep the 450 000 movies in a few days. For my purpose, data quality is not significantly worse than imdb, and as imdb key is also included there's always the possibility to complete my dataset later (I actually did it)

    www.Boxofficemojo.com has some interesting budget/revenue figures (which are sorely lacking in both imdb & tmdb), but it actually tracks only a few thousand movies, mainly blockbusters. There are other professional sources that are used by film industry to get better predictive / marketing insights but that's beyond my reach for this experiment.

    www.wikipedia.com is an interesting source with no real cap on API calls, however it requires a bit of webscraping and for movies or directors the layout and quality varies a lot, so I suspected it'd get a lot of work to get insights so I put this source in lower priority.

    www.google.com will ban you after a few minutes of web scraping because their job is to scrap data from others, than sell it, duh.

    • It's worth mentionning that there are a few dumps of Netflix anonymized user tastes on kaggle, because they've organised a few competitions to improve their recommendation models. https://www.kaggle.com/netflix-inc/netflix-prize-data

    • Online databases are largely white anglo-saxon centric, meaning bollywood (India is the 2nd bigger producer of movies) offer is mostly absent from datasets. I'm fine with that, as it's not my cup of tea plus I lack domain knowledge. The sheer amount of indian movies would probably skew my results anyway (I don't want to have too many martial-arts-musicals in my recommendations ;-)). I have, however, tremendous respect for indian movie industry so I'd love to collaborate with an indian cinephile ! https://img11.hostingpics.net/pics/340226westerns.png" alt="Westerns">

    Inspiration

    Starting from there, I had multiple problem statements for both supervised / unsupervised machine learning

    • Can I program a tailored-recommendation system based on my own criteria ?

    • What are the characteristics of movies/directors I like the most ?

    • What is the probability that I will like my next movie ?

    • Can I find the data ?

    One of the objectives of sharing my work here is to find cinephile data-scientists who might be interested and, hopefully, contribute or share insights :) Other interesting leads : use tagline for NLP/Clustering/Genre guessing, leverage on budget/revenue, link with other data sources using the imdb normalized title, etc.

    https://img11.hostingpics.net/pics/977004matrice.png" alt="Correlation matrix">

    Motivation, Disclaimer and Acknowledgements

    • I've graduated from an french engineering school, majoring in artificial intelligence, but that was 17 years ago right in the middle of A.I-winter. Like a lot of white male rocket scientists, I've ended up in one of the leading european investment bank, quickly abandonning IT development to specialize in trading/risk project management and internal politics. My recent appointment in the Data Office made me aware of recent breakthroughts in datascience, and I thought that developing a side project would be an excellent occasion to learn something new. Plus it'd give me a well-needed credibility which too often lack decision makers when it comes to datascience.

    • I've worked on some of the features with Cédric Paternotte, a fellow friend of mine who is a professor of philosophy of sciences in La Sorbonne. Working with someone with a different background seem a good idea for motivation, creativity and rigor.

    • Kudos to www.themoviedb.org or www.wikipedia.com sites, who really have a great attitude towards open data. This is typically NOT the case of modern-bigdata companies who mostly keep data to themselves to try to monetize it. Such a huge contrast with imdb or instagram API, which generously let you grab your last 3 comments at a miserable rate. Even if 15 years ago this seemed a mandatory path to get services for free, I predict one day governments will need to break this data monopoly.

    [Disclaimer : I apologize in advance for my engrish (I'm french ^-^), any bad-code I've written (there are probably hundreds of way to do it better and faster), any pseudo-scientific assumption I've made, I'm slowly getting back in statistics and lack senior guidance, one day I regress a non-stationary time series and the day after I'll discover I shouldn't have, and any incorrect use of machine-learning models]

    https://img11.hostingpics.net/pics/898068408x161poweredbyrectanglegreen.png" alt="powered by themoviedb.org">

  17. f

    Bollywood VS Hollywood runtime comparison

    • figshare.com
    txt
    Updated Jan 26, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yoritam Nekrabooty (2019). Bollywood VS Hollywood runtime comparison [Dataset]. http://doi.org/10.6084/m9.figshare.7635752.v3
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 26, 2019
    Dataset provided by
    figshare
    Authors
    Yoritam Nekrabooty
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    Hollywood
    Description

    ボリウッド映画とハリウッド映画(アメリカ映画)の上映時間の変遷を比較するためのデータセットとスクリプトです。データはDBpedia/Wikipediaから自動抽出したものを分析しやすいように手作業で整理しました。現在、1970年から2018年のデータがあります。Dataset & visualization script of Bollywood and Hollywood feature-length film runtimes (1970-2018) based on data publicly available on DBpedia/Wikipedia.Bollywood: https://en.wikipedia.org/wiki/Category:Lists_of_Bollywood_films_by_yearHollywood:https://en.wikipedia.org/wiki/Category:Lists_of_American_films_by_yearWork/runtime data were obtained from http://dbpedia.org/ when available.Otherwise, if the "Running time" field was present in the http://en.wikipedia.org/wiki page Infobox, the data in that field was used.The data were further manually screened to remove any entries for non-feature-length films such as short films and TV series; however, this screening was not exhaustive.

  18. Biggest Netflix libraries in the world 2024

    • statista.com
    • ai-chatbox.pro
    Updated Oct 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Biggest Netflix libraries in the world 2024 [Dataset]. https://www.statista.com/statistics/1013571/netflix-library-size-worldwide/
    Explore at:
    Dataset updated
    Oct 21, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jul 2024
    Area covered
    World
    Description

    Industry data revealed that Slovakia had the most extensive Netflix media library worldwide as of July 2024, with over 8,500 titles available on the platform. Interestingly, the top 10 ranking was spearheaded by European countries. Where do you get the most bang for your Netflix buck? In February 2024, Liechtenstein and Switzerland were the countries with the most expensive Netflix subscription rates. Viewers had to pay around 21.19 U.S. dollars per month for a standard subscription. Subscribers in these countries could choose from between around 6,500 and 6,900 titles. On the other end of the spectrum, Pakistan, Egypt, and Nigeria are some of the countries with the cheapest Netflix subscription costs at around 2.90 to 4.65 U.S. dollars per month. Popular content on Netflix While viewing preferences can differ across countries and regions, some titles have proven particularly popular with international audiences. As of mid-2024, "Red Notice" and "Don't Look Up" were the most popular English-language movies on Netflix, with over 230 million views in its first 91 days available on the platform. Meanwhile, "Troll" ranks first among the top non-English language Netflix movies of all time. The monster film has amassed 103 million views on Netflix, making it the most successful Norwegian-language film on the platform to date.

  19. PeakAffectDS

    • zenodo.org
    zip
    Updated Apr 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nick Greene; Steven R. Livingstone; Steven R. Livingstone; Lech Szymanski; Lech Szymanski; Nick Greene (2025). PeakAffectDS [Dataset]. http://doi.org/10.5281/zenodo.6403363
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nick Greene; Steven R. Livingstone; Steven R. Livingstone; Lech Szymanski; Lech Szymanski; Nick Greene
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Contact Information

    If you would like further information about PeakAffectDS, to purchase a commercial license, or if you experience any issues downloading files, please contact us at peakaffectds@gmail.com.

    Description

    PeakAffectDS contains 663 files (total size: 1.84 GB), consisting of 612 physiology files, and 51 perceptual rating files. The dataset contains 51 untrained research participants (39 female, 12 male), who had their body physiology recorded while watching movie clips validated to induce strong emotional reactions. Emotional conditions included: calm, happy, sad, angry, fearful, and disgust; along with baseline a neutral condition. Four physiology channels were recorded with a Biopac MP36 system: two facial muscles with fEMG (zygomaticus major, corrugator supercilii) using Ag/AgCl electrodes, heart activity with ECG using a 1-Lead, Lead II configuration, and respiration with a wearable strain-gauge belt. While viewing movie clips, participants indicated in real-time when they experienced a "peak" emotional event, including: chills, tears, or the startle reflex. After each clip, participants further rated their felt emotional state using a forced-choice categorical response measure, along with their felt Arousal and Valence. All data are provided in plaintext (.csv) format.

    PeakAffectDS was created in the Affective Data Science Lab.

    Physiology files

    Each participant has 12 .CSV physiology files, consisting of 6 Emotional conditions, and 6 Neutral baseline conditions. All physiology channels were recorded at 2000 Hz. A 50Hz notch filter was then applied to fEMG and ECG channels to remove mains hum. Each .CSV file contains 6 columns, in order from left to right:

    1. Sample timestamp (units: seconds)
    2. EMG Zygomaticus (units: millivolts)
    3. EMG Corrugator (units: millivolts)
    4. ECG (units: millivolts)
    5. Peak event makers: 0 = no event, 1 = chills, 2 = tears, 3 = startle

    Perceptual files

    There are 51 perceptual ratings files, one for each participant. Each .CSV file contains 4 columns, in order from left to right:

    1. Filename of presented stimulus (see File naming Convention, below)
    2. Felt emotional response: 1 = neutral, 2 = calm, 3 = happy, 4 = sad, 5 = angry, 6 = fearful, 7 = disgust
    3. Felt Valence, ranging from: 1 = Very negative, to 7 = Very positive
    4. Felt Arousal, ranging from: 1 = Very low, to 7 = Very high

    File naming convention

    Each of the 612 physiology files has a unique filename. The filename consists of a 3-part numerical identifier (e.g., 09-02-03.csv). The first identifier refers to the participant's ID (09), while the remaining two identifiers refer to the stimulus presented for that recording (02-03.mp4); these identifiers define the stimulus characteristics:

    • Participant: 01 = participant 1, 02 = participant 2, ..., 51 = participant 51.
    • Emotion: 01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust.
    • Stimulus set. For Emotional files: 01 = group 1, 02 = group 2, 03 = group 3. For Neutral files: 01 = instance 1, 02 = instance 2, ..., 06 = instance 6.

    Filename example: 09-02-03.csv

    • Participant 9 (09)
    • Calm (02)
    • Stimulus Set 3 (03)

    Filename example: 09-01-05.csv

    • Participant 9 (09)
    • Neutral (01)
    • Instance 5 (05)

    Methods

    A 1-way mixed-design was used, with a within-subjects factor Emotion (6 levels: Calm, Happy, Sad, Angry, Fearful, Disgust) and a between-subjects factor Stimulus Set (3 levels). Trials were blocked by Affect Condition (Baseline, Emotional), with each participant presented 6 blocked trials: Baseline (neutral), then Emotional (Calm, ..., Disgust). This design reduced potential contamination from preceeding emotional trials, by ensuring that participant's physiology began close to a resting baseline for emotional conditions.

    Emotion was presented in pseudorandom order using a carryover balanced generalised Youden design, generated by the crossdes package in R. Eighteen emotional movie clips were used as stimuli, with three instances for each emotion category (6x3). Clips were then grouped into one of three Stimulus Sets, with participants assigned to a given Set using Block randomisation. For example, participants assigned to Stimulus Set 1 (PID: 1, 4, 7, ...) all saw the same movie clips, but these clips differed to those in Sets 2 and 3. Six Neutral baseline movie clips were used as stimuli, with all participants viewing the same neutral clips, with their order also generated with a Youden design.

    Stimulus duration varied, with clips lasting several minutes. Lengthy clips without repetition were used to help ensure that participants became engaged, and experienced genuine, strong emotional responses. Participants were instructed to immediately indicate using the keyboard when experiencing a "peak" emotional event, including: chills, tears, or startle. Participants were permitted to indicate multiple events in a single trial, and identified the type of the evens at the trial feedback stage, along with ratings of emotional category, arousal, and valence. The concept of peak physiological events was explained at the beginning of the experiment, but the three states were not described as being associated with any particular emotion or valence.

    License information

    PeakAffectDS is released under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, CC BY-NC-SA 4.0.

    Citing PeakAffectDS

    Greene, N., Livingstone, S. R., & Szymanski, L. (2022). PeakAffectDB [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6403363

  20. g

    Dataset for: Simultaneous Chemical and Refractive Index Sensing in the 1-2.5...

    • data.griidc.org
    • search.dataone.org
    Updated Mar 8, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wei-Chuan Shih (2017). Dataset for: Simultaneous Chemical and Refractive Index Sensing in the 1-2.5 micron Near-Infrared Wavelength Range on Nanoporous Gold Disks [Dataset]. http://doi.org/10.7266/N7FF3QRM
    Explore at:
    Dataset updated
    Mar 8, 2017
    Dataset provided by
    GRIIDC
    Authors
    Wei-Chuan Shih
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We developed a new method to obtain chemical and refractive index sensing between the 1 and 2.5 micron near-infrared wavelength on nanoporous gold (NPG) disks. We fabricated NPG disks in the laboratory by sputtering a gold-silver alloy film onto a glass substrate at approximately 80 nm thickness. Polystyrene beads were deposited onto the alloy film in a single layer and were reduced in size using an oxygen plasma treatment. The bead pattern was transferred onto the alloy using a sputter-etch method in argon plasma. After etching, the alloy was sonicated in chloroform to remove residual beads. The disks were then dealloyed using nitric acid. We measured infrared absorption using dispersive scanning UV-Vis-NIR and FT-IR inteferometric spectrometers. This dataset reports the extinction spectra of water on NPG disks with diameters of either 350 or 600 nm. For NPG disks of 350 nm diameter the extinction spectra are also provided for 6 other solvents with different refractive indices: salt water, ethanol, hexane, iso-octane, hexadecane, and toluene. We examined the surface-enhanced near-infrared absorption of 350 nm and 600 nm diameter NPG disks with a self-assembled monolayer (SAM) of octadecanethiol (ODT) and report the extinction spectra in this dataset. The surface-enhanced near-infrared absorption (SENIRA) spectra of each of hexadecane, dodecane, siloxane, pyrene, and Louisiana sweet grade crude oil on either 350 nm or 600 nm NPG disks is reported in this dataset. Lastly we deposited a film poly(methyl methacrylate) (PMMA) of varying thickness onto the NPG disk substrate. The PMMA films varied in thickness from 50-150 nm. The surface-enhanced near-infrared absorption spectra of 350 nm and 600 nm NPG disks on films of 50-150 nm in thickness is reported in this dataset, as well as the wavelength shift at 1398 nm. This dataset is associated with the paper: Shih, W. - C., Santos, G. M., Zhao, F., Zenasni, O., & Arnob, M. M. P. (2016). Simultaneous Chemical and Refractive Index Sensing in the 1-2.5 μm Near-Infrared Wavelength Range on Nanoporous Gold Disks. Nano Lett., 16(7), 4641–4647, doi:10.1021/acs.nanolett.6b01959.

  21. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Datasimple (2025). IMDb Movie Genre Classification Dataset [Dataset]. https://www.opendatabay.com/data/web-social/2e13f07c-9c7c-4856-80c0-1a027f82b3c9

IMDb Movie Genre Classification Dataset

Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 23, 2025
Dataset authored and provided by
Datasimple
Area covered
Entertainment & Media Consumption
Description
  1. Overview of the Data movies_overview.csv:

Columns: title: The movie title overview: A brief description or synopsis of the movie genre_ids: One or more genre identifiers (which could be multi-label) movies_genres.csv:

Columns: id: Genre identifier name: The corresponding genre name This setup allows you to map each movie’s genre_ids from the movies_overview file to its actual genre names using the movies_genres mapping.

  1. Proposed NLP Task: Multi-Label Genre Classification Objective: Create an NLP model that, given a movie’s overview, predicts the correct genre(s) for that movie. Since a movie may belong to multiple genres, this is a multi-label classification task.

  2. Challenge Breakdown a. Problem Statement Participants are tasked with designing and training an NLP model that takes as input the movie overview text and outputs one or more genres. The challenge could encourage approaches that span from classical text classification methods (e.g., TF-IDF with logistic regression) to modern transformer-based models (e.g., BERT, RoBERTa).

b. Data Preprocessing Text Cleaning & Tokenization: Clean the movie overviews (e.g., lowercasing, removing special characters) and tokenize the text. Label Preparation: Transform the genre_ids into a multi-label format. Use movies_genres.csv to convert these IDs into genre names. Data Splitting: Create training, validation, and test sets, ensuring the distribution of genres is well represented. c. Baseline Models Encourage participants to start with simple models (e.g., bag-of-words, TF-IDF combined with logistic regression or random forest) and progress towards deep learning approaches like LSTM-based networks or transformer models.

d. Evaluation Metrics Since this is a multi-label task, consider evaluation metrics such as:

F1 Score (Macro / Micro): Balances precision and recall. Hamming Loss: Measures how many labels are incorrectly predicted. Subset Accuracy: For stricter evaluation (all labels must match exactly). 4. Additional Considerations Baseline Code & Notebooks: Provide a starter notebook with initial data loading, preprocessing, and a simple baseline model. This helps lower the entry barrier for participants who may be new to multi-label NLP tasks.

Evaluation Server & Leaderboard: Ensure that your Kaggle competition setup allows for automatic evaluation using the selected metrics and that a public leaderboard is available for continuous feedback.

Documentation & Discussion: Include detailed documentation describing the datasets, the task requirements, and the evaluation procedure. Additionally, host a discussion forum to foster collaboration among participants.

  1. Final Remarks This challenge not only tests participants’ ability to handle multi-label classification and text processing but also encourages them to explore advanced NLP techniques and model evaluation strategies. The combination of movie overviews and genre mapping offers a rich and interesting dataset for an engaging Kaggle competition.

Original Data Source: IMDb Movie Genre Classification Dataset

Search
Clear search
Close search
Google apps
Main menu