Columns: title: The movie title overview: A brief description or synopsis of the movie genre_ids: One or more genre identifiers (which could be multi-label) movies_genres.csv:
Columns: id: Genre identifier name: The corresponding genre name This setup allows you to map each movie’s genre_ids from the movies_overview file to its actual genre names using the movies_genres mapping.
Proposed NLP Task: Multi-Label Genre Classification Objective: Create an NLP model that, given a movie’s overview, predicts the correct genre(s) for that movie. Since a movie may belong to multiple genres, this is a multi-label classification task.
Challenge Breakdown a. Problem Statement Participants are tasked with designing and training an NLP model that takes as input the movie overview text and outputs one or more genres. The challenge could encourage approaches that span from classical text classification methods (e.g., TF-IDF with logistic regression) to modern transformer-based models (e.g., BERT, RoBERTa).
b. Data Preprocessing Text Cleaning & Tokenization: Clean the movie overviews (e.g., lowercasing, removing special characters) and tokenize the text. Label Preparation: Transform the genre_ids into a multi-label format. Use movies_genres.csv to convert these IDs into genre names. Data Splitting: Create training, validation, and test sets, ensuring the distribution of genres is well represented. c. Baseline Models Encourage participants to start with simple models (e.g., bag-of-words, TF-IDF combined with logistic regression or random forest) and progress towards deep learning approaches like LSTM-based networks or transformer models.
d. Evaluation Metrics Since this is a multi-label task, consider evaluation metrics such as:
F1 Score (Macro / Micro): Balances precision and recall. Hamming Loss: Measures how many labels are incorrectly predicted. Subset Accuracy: For stricter evaluation (all labels must match exactly). 4. Additional Considerations Baseline Code & Notebooks: Provide a starter notebook with initial data loading, preprocessing, and a simple baseline model. This helps lower the entry barrier for participants who may be new to multi-label NLP tasks.
Evaluation Server & Leaderboard: Ensure that your Kaggle competition setup allows for automatic evaluation using the selected metrics and that a public leaderboard is available for continuous feedback.
Documentation & Discussion: Include detailed documentation describing the datasets, the task requirements, and the evaluation procedure. Additionally, host a discussion forum to foster collaboration among participants.
Original Data Source: IMDb Movie Genre Classification Dataset
🎬 Movie Ratings & Metadata Dataset This dataset provides a structured collection of movie ratings along with essential metadata, making it a valuable resource for data analysis and machine learning projects. It consists of two main tables:
📊 Ratings Data This table contains user-generated ratings for various movies.
userId → Unique identifier for each user. movieId → Unique identifier for each movie (linked to the movie metadata table). rating → Rating given by the user (typically on a scale of 0-5). timestamp → Time when the rating was given (Unix format). 🎞️ Movie Metadata This table provides details about the movies being rated.
movieId → Unique identifier for each movie (linked to the ratings table). title → Movie title, including the release year. genres → List of genres associated with the movie (e.g., Action, Comedy, Drama). 🎯 Use Cases: ✔️ Recommender Systems – Build personalized movie recommendation models. ✔️ Trend Analysis – Explore how audience preferences change over time. ✔️ Sentiment & Popularity Analysis – Compare ratings across different genres.
This dataset is clean, structured, and ready for exploration. 🚀
👉 Start analyzing and uncover interesting movie insights! 🎥
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for "imdb"
Dataset Summary
Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
Supported Tasks and Leaderboards
More Information Needed
Languages
More Information Needed
Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Thank you for viewing my dataset, looking forward to seeing some codes.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The real dataset consists of movie evaluations from IMDB, which provides a platform where individuals can evaluate movies on a scale of 1 to 10. If a user rates a movie and clicks the share button, a Twitter message is generated. We then extract the rating from the Twitter message. We treat the ratings on the IMDB website as the event truths, which are based on the aggregated evaluations from all users, whereas our observations come from only a subset of users who share their ratings on Twitter. Using the Twitter API, we collect information about the follower and following relationships between individuals that generate movie evaluation Twitter messages. To better show the influence of social network information on event truth discovery, we delete small subnetworks that consist of less than 5 agents. The final dataset we use consists of 2266 evaluations from 209 individuals on 245 movies (events) and also the social network between these 209 individuals. We regard the social network to be undirected as both follower or following relationships indicate that the two users have similar taste.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We introduce LifeQA, a benchmark dataset for video question answering focusing on daily real-life situations. Current video question-answering datasets consist of movies and TV shows. However, it is well-known that these visual domains do not represent our day-to-day lives. Movies and TV shows, for example, benefit from professional camera movements, clean editing, crisp audio recordings, and scripted dialog between professional actors. While these domains provide a large amount of data for training models, their properties make them unsuitable for testing real-life question-answering systems. Our dataset, by contrast, consists of video clips that represent only real-life scenarios. We collect 275 such video clips and over 2.3k multiple-choice questions. In this paper, we analyze the challenging but realistic aspects of LifeQA and apply several state-of-the-art video question-answering models to provide benchmarks for future research.
For more information, refer to https://lit.eecs.umich.edu/lifeqa/.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides information on the top 6000 (actually 6118) most popular italian movies available on TMDB as of May 2023. The data was collected by a Python script through the TMDB API, entries were then filtered to remove any film that had "overview", "release_date" and "vote_average" data missing. Some details such as "genre", "duration", etc. where omitted but they can be easily retrieved with TMDB API by use of movie_id parameter from the "id" column.
EmoKey Moments Muse EEG Dataset (EKM-ED): A Comprehensive Collection of Muse S EEG Data and Key Emotional Moments
Dataset Description:
The EmoKey Moments EEG Dataset (EKM-ED) is an intricately curated dataset amassed from 47 participants, detailing EEG responses as they engage with emotion-eliciting video clips. Covering a spectrum of emotions, this dataset holds immense value for those diving deep into human cognitive responses, psychological research, and emotion-based analyses.
Dataset Highlights:
Precise Timestamps: Capturing the exact millisecond of EEG data acquisition, ensuring unparalleled granularity.
Brainwave Metrics: Illuminating the variety of cognitive states through the prism of Delta, Theta, Alpha, Beta, and Gamma waves.
Motion Data: Encompassing the device's movement in three dimensions for enhanced contextuality.
Auxiliary Indicators: Key elements like the device's positioning, battery metrics, and user-specific actions are meticulously logged.
Consent and Ethics: The dataset respects and upholds privacy and ethical standards. Every participant provided informed consent. This endeavor has received the green light from the Ethics Committee at the University of Granada, documented under the reference: 2100/CEIH/2021.
A pivotal component of this dataset is its focus on "key moments" within the selected video clips, honing in on periods anticipated to evoke heightened emotional responses.
Curated Video Clips within Dataset:
Film
Emotion
Duration (seconds)
The Lover
Baseline
43
American History X
Anger
106
Cry Freedom
Sadness
166
Alive
Happiness
310
Scream
Fear
395
The cornerstone of EKM-ED is its innovative emphasis on these key moments, bringing to light the correlation between distinct cinematic events and specific EEG responses.
Key Emotional Moments in Dataset:
Film
Emotion
Key moment timestamps (seconds)
American History X
Anger
36, 57, 68
Cry Freedom
Sadness
112, 132, 154
Alive
Happiness
227, 270, 289
Scream
Fear
23, 42, 79, 226, 279, 299, 334
Citation: Gilman, T. L., et al. (2017). A film set for the elicitation of emotion in research. Behavior Research Methods, 49(6). Link to the study
With its unparalleled depth and focus, the EmoKey Moments EEG Dataset aims to advance research in fields such as neuroscience, psychology, and affective computing, providing a comprehensive platform for understanding and analyzing human emotions through EEG data.
——————————————————————————————————— FOLDER STRUCTURE DESCRIPTION ———————————————————————————————————
questionnaires: all there response questionnaires (Spanish); raw and preprocessed Including SAM | ——preprocessed: Ficha_Evaluacion_Participante_SAM_Refactored.csv: the SAM responses for every film clip
key_moments: the key moment timestamps for every emotion’s clip
muse_wearable_data: XXXX | |—raw |——1: ID = 1 of subject |————muse: EEG data of Muse device |—————————ANGER_XXX.csv : leg data of the anger elicitation |—————————FEAR_XXX.csv : leg data of the fear elicitation |—————————HAPPINESS_XXX.csv : leg data of the happiness elicitation |—————————SADNESS_XXX.csv : leg data of the sadness elicitation |————order: film elicitation order of play: For example: HAPPINESS,SADNESS,ANGER,FEAR … | |—preprocessed |——unclean-signals: without removing EEG artifacts, noise, etc. |————muse: EEG data of Muse device |—————————0.0078125: data downsampled to 128 Hz from 256Hz recorded |——clean-signals: removed EEG artifacts, noise, etc. |————muse: EEG data of Muse device |—————————0.0078125: data downsampled to 128 Hz from 256Hz recorded
The ethical consent for this dataset was provided by La Comisión de Ética en Investigación de la Universidad de Granada, as documented in the approval titled: 'DETECCIÓN AUTOMÁTICA DE LAS EMOCIONES BÁSICAS Y SU INFLUENCIA EN LA TOMA DE DECISIONES MEDIANTE WEARABLES Y MACHINE LEARNING' registered under 2100/CEIH/2021.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Contributors: Liberty S. Hamilton, PhD, Maansi Desai, PhD, Alyssa Field, MEd
Email: liberty.hamilton@austin.utexas.edu
This is a sample BIDS dataset for the WIRED ICM course in Paris, France in March 2024.
This contains intracranial recordings collected by the Hamilton Lab at the University of Texas at Austin. These recordings include examples of evoked data during natural listening tasks along with some examples of seizure-related activity and vagus nerve stimulator (VNS) artifact for illustrative purposes. All procedures were approved by the University of Texas at Austin Institutional Review Board.
Funding: Support was provided by the National Institutes of Health National Institute on Deafness and Other Communication Disorders (R01 DC018579, to LSH).
movietrailers
- this task involves patients listening to movie clips from various Pixar, Disney, Dreamworks, and other movies. We have published previously using these stimuli in EEG (Desai et al. 2021).timit4
and timit5
- these tasks involve patients listening to subsets of the TIMIT acoustic phonetic corpus (Garofolo et al 1993). The events provided in the dataset mark the onset and offset of each sentence. In timit4
, each sentence is unique, while in timit5
, 10 sentences are repeated 10 times. This is the same stimulus set used in Mesgarani et al. 2014, Hamilton et al. 2018, Hamilton et al. 2021, and Desai et. al 2021.CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
*****ForrestGump-MEG: A audio-visual movie watching MEG dataset*****
For details please refer to our paper on [].
This dataset contains MEG data recorded from 11 subjects while watching the 2h long Chinese-dubbed audio-visual movie 'Forrest Gump'. The data were acquired with a 275-channel CTF MEG. Auxiliary data (T1w) as well as derivation data such as preprocessed data and MEG-MRI co-registration are also included.
Pre-process procedure description
The T1w images stored as NIFTI files were minimally-preprocessed using the anatomical preprocessing pipeline from fMRIPrep with default settings.
MEG data were pre-processed using MNE following a three-step procedure: 1. bad channels were detected and removed. 2. a high-pass filter of 1 Hz was applied to remove possible slow drifts from the continuous MEG data. 3. artifacts removal was performed with ICA.
Stimulus material
The audio-visual stimulus material was from the Chinese-dubbed 'Forrest Gump' DVD released in 2013 (ISBN: 978-7-7991-3934-0), which cannot be publicly released here due to copyright restrictions.
Dataset content overview
the data were organized following the MEG-BIDS using MNE-BIDS toolbox.
the pre-processed MEG data
The preprocessed MEG recordings including the preprocessed MEG data, the event files, the ICA decomposition and label files and the MEG-MRI coordinate transformation file are hosted here.
芒聰聰芒聰聙芒聰聙 ./derivatives/preproc_meg-mne_mri-fmriprep/sub-xx/ses-movie/meg/
芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_coordsystem.json
芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_channels.tsv
芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_decomposition.tsv
芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_events.tsv
芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_ica.fif.gz
芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_meg.fif
芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_meg.json
芒聰聹芒聰聙芒聰聙 ...
芒聰聰芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_trans.fif
the pre-processed MRI data
The preprocessed MRI volume, reconstructed surface, and other associations including transformation files are hosted here
芒聰聰芒聰聙芒聰聙 ./derivatives/preproc_meg-mne_mri-fmriprep/sub-xx/ses-movie/anat/
芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_desc-preproc_T1w.nii.gz
芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_hemi-L_inflated.surf.gii
芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_hemi-L_midthickness.surf.gii
芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_hemi-L_pial.surf.gii
芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_hemi-L_smoothwm.surf.gii
芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_hemi-R_inflated.surf.gii
芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_hemi-R_midthickness.surf.gii
芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_hemi-R_pial.surf.gii
芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_hemi-R_smoothwm.surf.gii
芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_space-MNI152NLin2009cAsym_desc-preproc_T1w.nii.gz
芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_space-MNI152NLin6Asym_desc-preproc_T1w.nii.gz
芒聰聰芒聰聙芒聰聙 ...
the FreeSurfer surface data, the high-resolution head surface and the MRI-fiducials are provided here
芒聰聰芒聰聙芒聰聙 ./derivatives/preproc_meg-mne_mri-fmriprep/sourcedata/
芒聰聰芒聰聙芒聰聙 freesurfer
芒聰聰芒聰聙芒聰聙 sub-xx
芒聰聰芒聰聙芒聰聙 ...
the raw data
芒聰聰芒聰聙芒聰聙 ./sub-xx/ses-movie/
芒聰聹芒聰聙芒聰聙 meg/
芒聰聜 芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_coordsystem.json
芒聰聜 芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_channels.tsv
芒聰聜 芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_events.tsv
芒聰聜 芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_meg.ds
芒聰聜 芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_task-movie_run-xx_meg.json
芒聰聜 芒聰聰芒聰聙芒聰聙 ...
芒聰聰芒聰聙芒聰聙 anat/
芒聰聹芒聰聙芒聰聙 sub-xx_ses-movie_T1w.json
芒聰聰芒聰聙芒聰聙 sub-xx_ses-movie_T1w.nii.gz
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Dataset Card for [Dataset Name]
Dataset Summary
The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. The corpus is based on the dataset introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/sst2.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Four stimuli were used:
(Note that the versions presented to subjects were edited to remove credits and title pages; these edited versions are available upon request.)
Stimuli Description - Iteration (https://youtu.be/c53fGdK84rc; 12:27 min:s) is a sci-fi movie that follows a female character as she goes through multiple iterations of waking up and trying to escape a facility. A male character appears toward the end to help her. - Defeat (https://youtu.be/6yN9VH_4GSQ; 7:57 min:s) follows a family of three (mother, two children) as the brother bullies his sister and she builds a time machine to go back and get revenge. - Growth (https://youtu.be/JyvFXBA3O8o; 8:27 min:s) follows a family of four (mother, father, two brothers) as the children grow up and eventually move out amid some family conflict. - Lemonade (https://youtu.be/Av07QiqmsoA; 7:27 min:s) is a Rube-Goldberg machine consisting of a series of objects that move throughout a house and ends in the pouring of a cup of lemonade. This movie was lightly edited to remove fleeting shots of human characters. Iteration and Defeat both contained screen cuts (continuity editing), whereas Growth and Lemonade were shot in a continuous fashion with the camera panning smoothly from one scene to the next.
Runs are a bit longer than the movie stimuli themselves. We dropped the first 2 TRs and the last 12 TRs for each functional run. Please reach out if you have any questions.
For Growth, this corresponds to TRs: 2:505
For Lemonade, this corresponds to TRs: 2:449
For Defeat, this corresponds to TRs: 2:480
For Iteration, this corresponds to TRs: 2:748
MPI (Max Planck Institute) Sintel is a dataset for optical flow evaluation that has 1064 synthesized stereo images and ground truth data for disparity. Sintel is derived from open-source 3D animated short film Sintel. The dataset has 23 different scenes. The stereo images are RGB while the disparity is grayscale. Both have resolution of 1024×436 pixels and 8-bit per channel.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Template-stripped substrates provide on-demand access to clean, ultraflat gold surfaces, avoiding the need for laborious cleaning procedures or the use of expensive single-crystal electrodes. While these gold/adhesion layer/support sandwich structures are most conveniently prepared through the application of epoxy or optical adhesives, such composites exhibit instabilities in organic solvents that limit their wider application. Here we demonstrate that substrates with solvent-impermeable metal films can be used in previously problematic chemical environments after integration into a protective, custom-built (electrochemical) flow cell. We apply our methodology to probe different self-assembled monolayers, observing reproducible alkanethiol reductive desorption features, an exemplary redox response using 6-(ferrocenyl)hexanethiol, and corroborate findings that cobalt(II) bis(terpyridine) assemblies exhibit a low coverage. This work significantly extends the utility of these substrates, relative to mechanically polished or freshly deposited alternatives, particularly for studies of systems involving adsorbed molecules whose properties are strongly influenced by the nanoscopic features of the metal-solution interface.
OpenSubtitles is collection of multilingual parallel corpora. The dataset is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages.
I love movies.
I tend to avoid marvel-transformers-standardized products, and prefer a mix of classic hollywood-golden-age and obscure polish artsy movies. Throw in an occasional japanese-zombie-slasher-giallo as an alibi. Good movies don't exist without bad movies.
On average I watch 200+ movies each year, with peaks at more than 500 movies. Nine years ago I started to log my movies to avoid watching the same movie twice, and also assign scores. Over the years, it gave me a couple insights on my viewing habits but nothing more than what a tenth-grader would learn at school.
I've recently suscribed to Netflix and it pains me to see the global inefficiency of recommendation systems for people like me, who mostly swear by "La politique des auteurs". It's a term coined by famous new-wave french movie critic André Bazin, meaning that the quality of a movie is essentially linked to the director and it's capacity to execute his vision with his crew. We could debate it depends on movie production pipeline, but let's not for now. Practically, what it means, is that I essentially watch movies from directors who made films I've liked.
I suspect Neflix calibrate their recommandation models taking into account the way the "average-joe" chooses a movie. A few months ago I had read a study based on a survey, showing that people chose a movie mostly based on genre (55%), then by leading actors (45%). Director or Release Date were far behind around 10% each. It is not surprising, since most people I know don't care who the director is. Lots of US blockbusters don't even mention it on the movie poster. I am aware that collaborative filtering is based on user proximity , which I believe decreases (or even eliminates) the need to characterize a movie. So here I'm more interested in content based filtering which is based on product proximity for several reasons :
Users tastes are not easily accessible. It is, after all, Netflix treasure chest
Movie offer on Netflix is so bad for someone who likes author's movies that it wouldn't help
Modeling a movie intrinsic qualities is a nice challenge
Enough.
"*The secret of getting ahead is getting started*" (Mark Twain)
https://img11.hostingpics.net/pics/117765networkgraph.png" alt="network graph">
The primary source is www.themoviedb.org. If you watch obscure artsy romanian homemade movies you may find only 95% of your movies referenced...but for anyone else it should be in the 98%+ range.
movies details are from www.themoviedb.org API : movies/details
movies crew & casting are from www.themoviedb.org API : movies/credits
both can be joined by id
they contain all 350k movies up, from end of 19th century to august 2017. If you remove short movies from imdb you get similar amounts of movies.
I uploaded the program to retrieve incremental movie details on github : https://github.com/stephanerappeneau/scienceofmovies/tree/master/PycharmProjects/GetAllMovies (need a dev API key from themoviedb.org though)
I have tried various supervised (decision tree) / unsupervised (clustering, NLP) approaches described in the discussions, source code is on github : https://github.com/stephanerappeneau/scienceofmovies
As a bonus I've uploaded the bio summary from top 500 critically-acclaimed directors from wikipedia, for some interesting NLTK analysis
Here is overview of the available sources that I've tried :
• Imdb.com free csv dumps (ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/temporaryaccess/) are badly documented, incomplete, loosely structured and impossible to join/merge. There's an API hosted by Amazon Web Service : 1€ every 100 000 requests. With around 1 million movies, it could become expensive also features are bare. So I've searched for other sources.
• www.themoviedb.org is based on crowdsourcing and has an excellent API, limited to 40 requests every 10 seconds. It is quite generous, well documented, and enough to sweep the 450 000 movies in a few days. For my purpose, data quality is not significantly worse than imdb, and as imdb key is also included there's always the possibility to complete my dataset later (I actually did it)
• www.Boxofficemojo.com has some interesting budget/revenue figures (which are sorely lacking in both imdb & tmdb), but it actually tracks only a few thousand movies, mainly blockbusters. There are other professional sources that are used by film industry to get better predictive / marketing insights but that's beyond my reach for this experiment.
• www.wikipedia.com is an interesting source with no real cap on API calls, however it requires a bit of webscraping and for movies or directors the layout and quality varies a lot, so I suspected it'd get a lot of work to get insights so I put this source in lower priority.
• www.google.com will ban you after a few minutes of web scraping because their job is to scrap data from others, than sell it, duh.
• It's worth mentionning that there are a few dumps of Netflix anonymized user tastes on kaggle, because they've organised a few competitions to improve their recommendation models. https://www.kaggle.com/netflix-inc/netflix-prize-data
• Online databases are largely white anglo-saxon centric, meaning bollywood (India is the 2nd bigger producer of movies) offer is mostly absent from datasets. I'm fine with that, as it's not my cup of tea plus I lack domain knowledge. The sheer amount of indian movies would probably skew my results anyway (I don't want to have too many martial-arts-musicals in my recommendations ;-)). I have, however, tremendous respect for indian movie industry so I'd love to collaborate with an indian cinephile !
https://img11.hostingpics.net/pics/340226westerns.png" alt="Westerns">
Starting from there, I had multiple problem statements for both supervised / unsupervised machine learning
Can I program a tailored-recommendation system based on my own criteria ?
What are the characteristics of movies/directors I like the most ?
What is the probability that I will like my next movie ?
Can I find the data ?
One of the objectives of sharing my work here is to find cinephile data-scientists who might be interested and, hopefully, contribute or share insights :) Other interesting leads : use tagline for NLP/Clustering/Genre guessing, leverage on budget/revenue, link with other data sources using the imdb normalized title, etc.
https://img11.hostingpics.net/pics/977004matrice.png" alt="Correlation matrix">
I've graduated from an french engineering school, majoring in artificial intelligence, but that was 17 years ago right in the middle of A.I-winter. Like a lot of white male rocket scientists, I've ended up in one of the leading european investment bank, quickly abandonning IT development to specialize in trading/risk project management and internal politics. My recent appointment in the Data Office made me aware of recent breakthroughts in datascience, and I thought that developing a side project would be an excellent occasion to learn something new. Plus it'd give me a well-needed credibility which too often lack decision makers when it comes to datascience.
I've worked on some of the features with Cédric Paternotte, a fellow friend of mine who is a professor of philosophy of sciences in La Sorbonne. Working with someone with a different background seem a good idea for motivation, creativity and rigor.
Kudos to www.themoviedb.org or www.wikipedia.com sites, who really have a great attitude towards open data. This is typically NOT the case of modern-bigdata companies who mostly keep data to themselves to try to monetize it. Such a huge contrast with imdb or instagram API, which generously let you grab your last 3 comments at a miserable rate. Even if 15 years ago this seemed a mandatory path to get services for free, I predict one day governments will need to break this data monopoly.
[Disclaimer : I apologize in advance for my engrish (I'm french ^-^), any bad-code I've written (there are probably hundreds of way to do it better and faster), any pseudo-scientific assumption I've made, I'm slowly getting back in statistics and lack senior guidance, one day I regress a non-stationary time series and the day after I'll discover I shouldn't have, and any incorrect use of machine-learning models]
https://img11.hostingpics.net/pics/898068408x161poweredbyrectanglegreen.png" alt="powered by themoviedb.org">
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
ボリウッド映画とハリウッド映画(アメリカ映画)の上映時間の変遷を比較するためのデータセットとスクリプトです。データはDBpedia/Wikipediaから自動抽出したものを分析しやすいように手作業で整理しました。現在、1970年から2018年のデータがあります。Dataset & visualization script of Bollywood and Hollywood feature-length film runtimes (1970-2018) based on data publicly available on DBpedia/Wikipedia.Bollywood: https://en.wikipedia.org/wiki/Category:Lists_of_Bollywood_films_by_yearHollywood:https://en.wikipedia.org/wiki/Category:Lists_of_American_films_by_yearWork/runtime data were obtained from http://dbpedia.org/ when available.Otherwise, if the "Running time" field was present in the http://en.wikipedia.org/wiki page Infobox, the data in that field was used.The data were further manually screened to remove any entries for non-feature-length films such as short films and TV series; however, this screening was not exhaustive.
Industry data revealed that Slovakia had the most extensive Netflix media library worldwide as of July 2024, with over 8,500 titles available on the platform. Interestingly, the top 10 ranking was spearheaded by European countries. Where do you get the most bang for your Netflix buck? In February 2024, Liechtenstein and Switzerland were the countries with the most expensive Netflix subscription rates. Viewers had to pay around 21.19 U.S. dollars per month for a standard subscription. Subscribers in these countries could choose from between around 6,500 and 6,900 titles. On the other end of the spectrum, Pakistan, Egypt, and Nigeria are some of the countries with the cheapest Netflix subscription costs at around 2.90 to 4.65 U.S. dollars per month. Popular content on Netflix While viewing preferences can differ across countries and regions, some titles have proven particularly popular with international audiences. As of mid-2024, "Red Notice" and "Don't Look Up" were the most popular English-language movies on Netflix, with over 230 million views in its first 91 days available on the platform. Meanwhile, "Troll" ranks first among the top non-English language Netflix movies of all time. The monster film has amassed 103 million views on Netflix, making it the most successful Norwegian-language film on the platform to date.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Contact Information
If you would like further information about PeakAffectDS, to purchase a commercial license, or if you experience any issues downloading files, please contact us at peakaffectds@gmail.com.
Description
PeakAffectDS contains 663 files (total size: 1.84 GB), consisting of 612 physiology files, and 51 perceptual rating files. The dataset contains 51 untrained research participants (39 female, 12 male), who had their body physiology recorded while watching movie clips validated to induce strong emotional reactions. Emotional conditions included: calm, happy, sad, angry, fearful, and disgust; along with baseline a neutral condition. Four physiology channels were recorded with a Biopac MP36 system: two facial muscles with fEMG (zygomaticus major, corrugator supercilii) using Ag/AgCl electrodes, heart activity with ECG using a 1-Lead, Lead II configuration, and respiration with a wearable strain-gauge belt. While viewing movie clips, participants indicated in real-time when they experienced a "peak" emotional event, including: chills, tears, or the startle reflex. After each clip, participants further rated their felt emotional state using a forced-choice categorical response measure, along with their felt Arousal and Valence. All data are provided in plaintext (.csv) format.
PeakAffectDS was created in the Affective Data Science Lab.
Physiology files
Each participant has 12 .CSV physiology files, consisting of 6 Emotional conditions, and 6 Neutral baseline conditions. All physiology channels were recorded at 2000 Hz. A 50Hz notch filter was then applied to fEMG and ECG channels to remove mains hum. Each .CSV file contains 6 columns, in order from left to right:
Perceptual files
There are 51 perceptual ratings files, one for each participant. Each .CSV file contains 4 columns, in order from left to right:
File naming convention
Each of the 612 physiology files has a unique filename. The filename consists of a 3-part numerical identifier (e.g., 09-02-03.csv). The first identifier refers to the participant's ID (09), while the remaining two identifiers refer to the stimulus presented for that recording (02-03.mp4); these identifiers define the stimulus characteristics:
Filename example: 09-02-03.csv
Filename example: 09-01-05.csv
Methods
A 1-way mixed-design was used, with a within-subjects factor Emotion (6 levels: Calm, Happy, Sad, Angry, Fearful, Disgust) and a between-subjects factor Stimulus Set (3 levels). Trials were blocked by Affect Condition (Baseline, Emotional), with each participant presented 6 blocked trials: Baseline (neutral), then Emotional (Calm, ..., Disgust). This design reduced potential contamination from preceeding emotional trials, by ensuring that participant's physiology began close to a resting baseline for emotional conditions.
Emotion was presented in pseudorandom order using a carryover balanced generalised Youden design, generated by the crossdes package in R. Eighteen emotional movie clips were used as stimuli, with three instances for each emotion category (6x3). Clips were then grouped into one of three Stimulus Sets, with participants assigned to a given Set using Block randomisation. For example, participants assigned to Stimulus Set 1 (PID: 1, 4, 7, ...) all saw the same movie clips, but these clips differed to those in Sets 2 and 3. Six Neutral baseline movie clips were used as stimuli, with all participants viewing the same neutral clips, with their order also generated with a Youden design.
Stimulus duration varied, with clips lasting several minutes. Lengthy clips without repetition were used to help ensure that participants became engaged, and experienced genuine, strong emotional responses. Participants were instructed to immediately indicate using the keyboard when experiencing a "peak" emotional event, including: chills, tears, or startle. Participants were permitted to indicate multiple events in a single trial, and identified the type of the evens at the trial feedback stage, along with ratings of emotional category, arousal, and valence. The concept of peak physiological events was explained at the beginning of the experiment, but the three states were not described as being associated with any particular emotion or valence.
License information
PeakAffectDS is released under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, CC BY-NC-SA 4.0.
Citing PeakAffectDS
Greene, N., Livingstone, S. R., & Szymanski, L. (2022). PeakAffectDB [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6403363
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We developed a new method to obtain chemical and refractive index sensing between the 1 and 2.5 micron near-infrared wavelength on nanoporous gold (NPG) disks. We fabricated NPG disks in the laboratory by sputtering a gold-silver alloy film onto a glass substrate at approximately 80 nm thickness. Polystyrene beads were deposited onto the alloy film in a single layer and were reduced in size using an oxygen plasma treatment. The bead pattern was transferred onto the alloy using a sputter-etch method in argon plasma. After etching, the alloy was sonicated in chloroform to remove residual beads. The disks were then dealloyed using nitric acid. We measured infrared absorption using dispersive scanning UV-Vis-NIR and FT-IR inteferometric spectrometers. This dataset reports the extinction spectra of water on NPG disks with diameters of either 350 or 600 nm. For NPG disks of 350 nm diameter the extinction spectra are also provided for 6 other solvents with different refractive indices: salt water, ethanol, hexane, iso-octane, hexadecane, and toluene. We examined the surface-enhanced near-infrared absorption of 350 nm and 600 nm diameter NPG disks with a self-assembled monolayer (SAM) of octadecanethiol (ODT) and report the extinction spectra in this dataset. The surface-enhanced near-infrared absorption (SENIRA) spectra of each of hexadecane, dodecane, siloxane, pyrene, and Louisiana sweet grade crude oil on either 350 nm or 600 nm NPG disks is reported in this dataset. Lastly we deposited a film poly(methyl methacrylate) (PMMA) of varying thickness onto the NPG disk substrate. The PMMA films varied in thickness from 50-150 nm. The surface-enhanced near-infrared absorption spectra of 350 nm and 600 nm NPG disks on films of 50-150 nm in thickness is reported in this dataset, as well as the wavelength shift at 1398 nm. This dataset is associated with the paper: Shih, W. - C., Santos, G. M., Zhao, F., Zenasni, O., & Arnob, M. M. P. (2016). Simultaneous Chemical and Refractive Index Sensing in the 1-2.5 μm Near-Infrared Wavelength Range on Nanoporous Gold Disks. Nano Lett., 16(7), 4641–4647, doi:10.1021/acs.nanolett.6b01959.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Columns: title: The movie title overview: A brief description or synopsis of the movie genre_ids: One or more genre identifiers (which could be multi-label) movies_genres.csv:
Columns: id: Genre identifier name: The corresponding genre name This setup allows you to map each movie’s genre_ids from the movies_overview file to its actual genre names using the movies_genres mapping.
Proposed NLP Task: Multi-Label Genre Classification Objective: Create an NLP model that, given a movie’s overview, predicts the correct genre(s) for that movie. Since a movie may belong to multiple genres, this is a multi-label classification task.
Challenge Breakdown a. Problem Statement Participants are tasked with designing and training an NLP model that takes as input the movie overview text and outputs one or more genres. The challenge could encourage approaches that span from classical text classification methods (e.g., TF-IDF with logistic regression) to modern transformer-based models (e.g., BERT, RoBERTa).
b. Data Preprocessing Text Cleaning & Tokenization: Clean the movie overviews (e.g., lowercasing, removing special characters) and tokenize the text. Label Preparation: Transform the genre_ids into a multi-label format. Use movies_genres.csv to convert these IDs into genre names. Data Splitting: Create training, validation, and test sets, ensuring the distribution of genres is well represented. c. Baseline Models Encourage participants to start with simple models (e.g., bag-of-words, TF-IDF combined with logistic regression or random forest) and progress towards deep learning approaches like LSTM-based networks or transformer models.
d. Evaluation Metrics Since this is a multi-label task, consider evaluation metrics such as:
F1 Score (Macro / Micro): Balances precision and recall. Hamming Loss: Measures how many labels are incorrectly predicted. Subset Accuracy: For stricter evaluation (all labels must match exactly). 4. Additional Considerations Baseline Code & Notebooks: Provide a starter notebook with initial data loading, preprocessing, and a simple baseline model. This helps lower the entry barrier for participants who may be new to multi-label NLP tasks.
Evaluation Server & Leaderboard: Ensure that your Kaggle competition setup allows for automatic evaluation using the selected metrics and that a public leaderboard is available for continuous feedback.
Documentation & Discussion: Include detailed documentation describing the datasets, the task requirements, and the evaluation procedure. Additionally, host a discussion forum to foster collaboration among participants.
Original Data Source: IMDb Movie Genre Classification Dataset