https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Title: 9,565 Top-Rated Movies Dataset
Description:
This dataset offers a comprehensive collection of 9,565 of the highest-rated movies according to audience ratings on the Movie Database (TMDb). The dataset includes detailed information about each movie, such as its title, overview, release date, popularity score, average vote, and vote count. It is designed to be a valuable resource for anyone interested in exploring trends in popular cinema, analyzing factors that contribute to a movie’s success, or building recommendation engines.
Key Features:
- Title: The official title of each movie.
- Overview: A brief synopsis or description of the movie's plot.
- Release Date: The release date of the movie, formatted as YYYY-MM-DD
.
- Popularity: A score indicating the current popularity of the movie on TMDb, which can be used to gauge current interest.
- Vote Average: The average rating of the movie, based on user votes.
- Vote Count: The total number of votes the movie has received.
Data Source:
The data was sourced from the TMDb API, a well-regarded platform for movie information, using the /movie/top_rated
endpoint. The dataset represents a snapshot of the highest-rated movies as of the time of data collection.
Data Collection Process:
- API Access: Data was retrieved programmatically using TMDb’s API.
- Pagination Handling: Multiple API requests were made to cover all pages of top-rated movies, ensuring the dataset’s comprehensiveness.
- Data Aggregation: Collected data was aggregated into a single, unified dataset using the pandas
library.
- Cleaning: Basic data cleaning was performed to remove duplicates and handle missing or malformed data entries.
Potential Uses: - Trend Analysis: Analyze trends in movie ratings over time or compare ratings across different genres. - Recommendation Systems: Build and train models to recommend movies based on user preferences. - Sentiment Analysis: Perform text analysis on movie overviews to understand common themes and sentiments. - Statistical Analysis: Explore the relationship between popularity, vote count, and average ratings.
Data Format: The dataset is provided in a structured tabular format (e.g., CSV), making it easy to load into data analysis tools like Python, R, or Excel.
Usage License: The dataset is shared under [appropriate license], ensuring that it can be used for educational, research, or commercial purposes, with proper attribution to the data source (TMDb).
This description provides a clear and detailed overview, helping potential users understand the dataset's content, origin, and potential applications.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description: This dataset provides comprehensive movie statistics compiled from multiple sources, including Wikipedia, The Numbers, and IMDb. It offers a rich collection of information and insights into various aspects of movies, such as movie titles, production dates, genres, runtime minutes, director information, average ratings, number of votes, approval index, production budgets, domestic gross earnings, and worldwide gross earnings.
The dataset combines data scraped from Wikipedia, which includes details about movie titles, production dates, genres, runtime minutes, and director information, with data from The Numbers, a reliable source for box office statistics. Additionally, IMDb data is integrated to provide information on average ratings, number of votes, and other movie-related attributes.
With this dataset, users can analyze and explore trends in the film industry, assess the financial success of movies, identify popular genres, and investigate the relationship between average ratings and box office performance. Researchers, movie enthusiasts, and data analysts can leverage this dataset for various purposes, including data visualization, predictive modeling, and deeper understanding of the movie landscape.
Features: - Movie_title - Production_date - Genres - Runtime_minutes - Director_name (primaryName) - Director_professions (primaryProfession) - Director_birthYear - Director_deathYear - Movie_averageRating : refers to the average rating given by online users for a particular movie - Movie_numberOfVotes : refers to the number of votes given by online users for a particular movie - Approval_Index :is a normalized indicator (on scale 0-10) calculated by multiplying the logarithm of the number of votes by the average users rating. It provides a concise measure of a movie's overall popularity and approval among online viewers, penalizing both films that got too few reviews and blockbusters that got too many. - Production_budget ( $) - Domestic_gross ($) - Worldwide_gross ($)
Potential Applications:
Box office analysis: Analyze the relationship between production budgets, domestic and worldwide gross earnings, and profitability. Genre analysis: Identify the most popular genres based on movie counts and analyze their performance. Rating analysis: Explore the relationship between average ratings, number of votes, and financial success. Director analysis: Investigate the impact of directors on movie ratings and financial performance. Time-based analysis: Study movie trends over different production years and observe changes in production budgets, box office earnings, and genre preferences. By utilizing this dataset, users can gain valuable insights into the movie industry and uncover patterns that can inform decision-making, market research, and creative strategies.
The IMDb movie review dataset consists of a balanced sample of 25,000 positive and 25,000 negative reviews, divided into equal-size train and test sets, with an average document length of 231 words.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains metadata for the top 10,000 most popular movies available on The Movie Database (TMDB). TMDB is a widely used online platform and community providing extensive details on films, TV shows, and related content. Users can browse and search for titles, accessing information such as cast, crew, synopses, and ratings. This dataset is designed for data analysts, researchers, and developers keen on examining movie popularity and attributes. It is a valuable resource for various analyses, including exploring trends in movie genres over time, identifying patterns in budget versus revenue, and evaluating the impact of different attributes on a film's popularity. The data was gathered from TMDB's public API and has undergone thorough cleaning and preprocessing to enhance its quality and usability.
This dataset comprises metadata for the top 10,000 most popular movies from The Movie Database. Specific numbers for rows or records beyond this top count are not available. The data has been meticulously crafted from raw information obtained via TMDB's public API and subsequently cleaned and preprocessed.
Ideal applications for this dataset include: * Analysing trends in movie genres over time. * Identifying correlations between movie budget, revenue, and popularity. * Developing and testing movie recommendation systems. * Exploring the impact of different attributes on a movie's success. * Academic research into film industry dynamics and audience reception.
The dataset's geographic coverage is Global, reflecting the worldwide reach of movies and TMDB's user base. It focuses on the top 10,000 most popular movies, implying a snapshot of current or recent popularity without a specific historical time range for the films themselves. No specific demographic scope for the data is provided, but it reflects engagement from TMDB users generally.
CC0
This dataset is primarily intended for: * Data Analysts: To scrutinise and analyse movie popularity and attributes. * Researchers: For academic studies on film trends, audience behaviour, and industry patterns. * Developers: To build and test applications such as movie recommendation engines or data visualisations.
Original Data Source: TMDB_top_rated_movies
This repository contains network graphs and network metadata from Moviegalaxies, a website providing network graph data from about 773 films (1915–2012). The data includes individual network graph data in Graph Exchange XML Format and descriptive statistics on measures such as clustering coefficient, degree, density, diameter, modularity, average path length, the total number of edges, and the total number of nodes.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This is dataset of the 10,000 most popular movies across the world, irrespective of language and recency. These have been extracted using TMDb API.
What is TMDB's API? The closed-source API service is for those people interested in using their movies, TV shows or actor images and/or data in their application. TMDb's API is a system that they provide for developers and their team to programmatically fetch and use TMDb's data and/or images. Their API is free to use as long as you attribute TMDb as the source of the data and/or images. Also, they update their API from time to time.
This dataset lists 10,000 most popular movies across the globe. Information held inside the dataset - A. Dataset 1 : Movies dataset - 1. title - Title of the Movie in English. 2. overview - A small summary of the plot. 3. original_lang - Original language it was shot in. 4. rel_date - Date of release. 5. popularity - Popularity. 6. vote_count - Votes received. 7. vote_average - Average of all votes received.
B. Dataset 2 : Genres dataset 1. id 2. Movie ID 3. Genre
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Movielens dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ayushimishra2809/movielens-dataset on 28 January 2022.
--- Dataset description provided by original source is as follows ---
GroupLens Research has collected and made available rating data sets from the MovieLens web site (http://movielens.org). The data sets were collected over various periods of time, depending on the size of the set. A recommender system is a simple algorithm whose aim is to provide the most relevant information to a user by discovering patterns in a dataset. The algorithm rates the items and shows the user the items that they would rate highly.
The data consists of 105339 ratings applied over 10329 movies. The average rating is 3.5 and minimum and maximum rating is 0.5 and 5 respectively. There are 668 user who has given their ratings for 149532 movies.
Can you make a movie recommender system using any type of recommedation algorithms like content based, collaborative filtering etc?
--- Original source retains full ownership of the source dataset ---
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains details for 1262 Indonesian movies, compiled to offer insights into the country's film industry. It was assembled using an IMDb-Scraper and then converted and cleaned into a CSV file, providing a structured collection of movie information [1]. The data was collected from IMDb.com [1].
The dataset is provided in a CSV file format [1]. It includes 1262 unique movie records or rows [1, 2].
This dataset is ideal for: * Exploratory data analysis of Indonesian cinema trends [1]. * Natural Language Processing (NLP) tasks on movie descriptions [1]. * Analysing movie characteristics such as genre distribution, rating trends, and language prevalence. * Studying the impact of directors and actors within the Indonesian film landscape.
The dataset specifically covers Indonesian movies [1, 2]. The time range for these movies spans from 1926 to 2020 [2].
CCO
Original Data Source: IMDb Indonesian Movies
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Hollywood Theatrical Market Synopsis 1995 to 2021’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/johnharshith/hollywood-theatrical-market-synopsis-1995-to-2021 on 28 January 2022.
--- Dataset description provided by original source is as follows ---
https://images7.alphacoders.com/116/thumb-350-1165584.jpg" alt="Hollywood Films">
This Dataset contains the data of market analysis built on The Numbers unique categorization system, which uses 6 different criteria to identify a movie. All movies released since 1995 are categorized according to the following attributes: Creative type (factual, contemporary fiction, fantasy etc.), Source (book, play, original screenplay etc.), Genre (drama, horror, documentary etc.), MPAA rating, Production method (live action, digital animation etc.) and Distributor. In order to provide a fair comparison between movies released in different years, all rankings are based on ticket sales, which are calculated using average ticket prices announced by the MPAA in their annual state of the industry report.
The Dataset contains various files illustrating statistics such as annual ticket sales, highest grossers each year since 1995, top grossing creative types, top grossing distributors, top grossing genres, top grossing MPAA ratings, top grossing sources, top grossing production methods and the number of wide releases each year by various distributors.
The data was obtained from The Numbers website. Their theatrical market pages are based on the domestic theatrical market performance only. The domestic market is defined as the North American movie region (consisting of the United States, Canada, Puerto Rico and Guam). This data can be found from the website https://www.the-numbers.com/market/ with detailed analysis.
2020 and 2021 have been rough years for the movie industry, and being a huge movie fanatic inspired me to share a dataset showing the exponential growth of box office collections as well as ticket sales over time (and the decline after 2020 due to the Covid-19 pandemic) indirectly indicating the quality of modern day films. This Dataset can also be used to study the genres which attract audience the most and encourage one to create an amazing genre specific plot in order to take one step closer to becoming the next most successful director!
--- Original source retains full ownership of the source dataset ---
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset offers a valuable corpus of film reviews in Spanish, specifically designed to support Natural Language Processing (NLP) research and development. In a field that often focuses heavily on the English language, this collection provides a much-needed resource for understanding natural language within the Spanish context. It comprises user-generated criticisms of over 50 highly relevant Spanish films, sourced from the Filmaffinity.com website. The aim is to foster knowledge sharing in Spanish NLP among users.
The dataset is structured in a tabular format, typically available as a CSV file. It contains reviews related to more than 50 Spanish films. Specific counts for rows or records are not provided; however, the file's delimiter is a double pipe "||".
This dataset is ideally suited for various applications in Natural Language Processing (NLP) focusing on the Spanish language. It can be used for: * Developing and testing NLP models for sentiment analysis on Spanish text. * Training machine learning models for text classification or topic modelling. * Learning and experimenting with NLP techniques using a real-world Spanish corpus. * Facilitating knowledge exchange and collaborative projects on Spanish NLP.
The dataset focuses exclusively on Spanish films and Spanish language reviews. The films included are those considered most relevant at the time the dataset was created, ensuring a relevant and current body of criticism from Filmaffinity.com users. There is no specified time range beyond the creation date for the included films.
CC0
This dataset is particularly beneficial for: * Spanish-speaking Kaggle users looking to contribute to and learn from NLP projects in their native language. * Researchers and students in artificial intelligence, linguistics, or data science focusing on NLP within the Spanish context. * Developers building applications that require understanding or processing Spanish text, especially in the entertainment or media sectors. * Anyone interested in analysing user-generated content and opinions on films in Spanish.
Original Data Source: Críticas películas filmaffinity en Español
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the raw data for the paper "Learning heterogeneous reaction kinetics from X-ray movies pixel-by-pixel". The MAT-file contains two variables: 1. 'stxm' is a structure array that contains all the STXM data. Each entry contains the STXM images scanned over one region, which may contain one or two particles. stxm contains the following fields:
name: the name of the scanned region. scan: the scan number for all frames associated with this region. date: the date of the experiment. time: the time of the scan. lfpmat: the intensity of LiFePO4 (the variable 'a' in SI Eq. 112). The first two dimensions are image coordinate (x and y). The third dimension is the frame index, whose length is equal to the length of 'scan'. fpmat: the intensity of FePO4 (the variable 'b' in SI Eq. 112). The first two dimensions are image coordinate (x and y). The third dimension is the frame index, whose length is equal to the length of 'scan'. segment: a cell in which each entry is frame indices associated with a charge or discharge half cycle. boundary: a cell in which the i-th entry is the image coordinates of the boundary of particle i in this region. The first and second columns are the x and y coordinates, respectively. roi: a cell in which the i-th entry is the region-of-interest (ROI) of particle i in this region. The ROI is a logical array in which 1 indicates a pixel inside the particle and 0 indicates a pixel outside the particle. Area: the area (in number of pixels) of the particles in this region. Centroid: the image coordinate of centroid of the particles in this region. Each row corresponds to a particle. The first and second columns are the x and y coordinates, respectively. Orientation: the angle between the particles' major axis and the x-axis in degrees. MajorAxisLength: the length of the particles' major axis defined by the second moment of the ROI. MinorAxisLength: the length of the particles' major axis defined by the second moment of the ROI. Crate: the global C-rate of the charge or discharge half cycle(s) measured for the entire cell. Its length is equal to the length of 'segment'. avg: a cell in which the i-th entry is the average Li fraction of particle i in all the frames. var: a cell in which the i-th entry is the variance of the Li fraction of particle i in all the frames. avgrate: the average local C-rate of the particles defined to be the change in average Li fraction over the duration of the half-cycle. Each row corresponds a particle. Each column corresponds to a half-cycle. avgrate(i,j) is the average local C-rate of particle i during half-cycle j. inversion_Li_frac: the simulated Li fraction from the inversion result as shown in Fig. 2, SI Fig. 57, and SI Movie 1. inversion_Li_frac{i}{j} is the simulated Li fraction field of particle i during half-cycle j. The first two dimensions are image coordinates (x and y) (the sizes are the same as the first two dimensions of lfpmat and fpmat). The third dimension is the frame index whose length is equal to the length of segment{j}. The value outside the ROI is NaN. inversion_k: the inverted heterogeneity k(x,y) as shown in Fig. 3b and Fig. 55. inverson_k{i} is the inverted k(x,y) of particle i. The two dimensions are image coordinates (x and y) (the sizes are the same as the first two dimensions of lfpmat and fpmat). The value outside the ROI is NaN.
'aem' is a structure array that contains all the AEM data. Each entry contains the AEM image of a particle. aem contains the following fields:
carbon: the AEM carbon signal I(x,y). name: the name of the scanned region that the particle is in. region: the index of the particle in the scanned region. augercp: coordinates of the control points in the AEM image. The first and second columns are the x and y coordinates, respectively. stxmcp: coordinates of the corresponding control points in the corresponding STXM image. The first and second columns are the x and y coordinates, respectively. 'augercp' and 'stxmcp' are used for image registration between AEM and STXM. auger2stxm: an affine2d object that determines the affine transformation for registration from AEM to STXM images. It is defined based on the control points. tree: the index in the 'stxm' structure array that this particle corresponds to. roi: the ROI of the particle.
MAD (Movie Audio Descriptions) is an automatically curated large-scale dataset for the task of natural language grounding in videos or natural language moment retrieval. MAD exploits available audio descriptions of mainstream movies. Such audio descriptions are redacted for visually impaired audiences and are therefore highly descriptive of the visual content being displayed. MAD contains over 384,000 natural language sentences grounded in over 1,200 hours of video, and provides a unique setup for video grounding as the visual stream is truly untrimmed with an average video duration of 110 minutes. 2 orders of magnitude longer than legacy datasets.
Take a look at the paper for additional information.
From the authors on availability: "Due to copyright constraints, MAD’s videos will not be publicly released. However, we will provide all necessary features for our experiments’ reproducibility and promote future research in this direction"
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Thank you for viewing my dataset, looking forward to seeing some codes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset published here was used to measure a high resolution 3D wavefom of isolated and reactivated axonemes from Chlamydomonas reinhardtii.
Note: This dataset contains a motion-blur correction appled to to the data in doi: https://doi.org/10.1101/2024.03.18.585533 and code that details how the 3D average waveform was calulated.
It was further used to show twist-torsion coupling in these axonemes (doi:10.1038/s41567-025-02783-2).
The data is organized in seven folders:
1) high resolution average 3D waveform of isolated and reactivated axonemes from Chlamydomonas Reinhardtii.
Data files (MATLAB and txt format) contain the 3D coordinates (along the 3D arc-length) of 32 axonemal shapes that comprise one beat-cycle.
A corresponding txt file describes the details of the dataset.
2) 3D waveforms of single isolated and reactivated axonemes from Chlamydomonas Reinhardtii.
Data files (MATLAB and txt format) contain the 3D shapes of 17 individual axonemes obtained from defocused darkfield-microsopy images.
A corresponding txt file describes the details of the dataset.
3) Image Raw Data of single isolated and reactivated axonemes used to reconstruct the 3D waveform
Movie files (multi-layer tif) of reactivated axonemes imaged with defocused-darkfield-microscopy.
A corresponding txt file describes the details of the dataset.
4) Calibration of defocused darkfield-microscopy.
Data file (MATLAB) contains the relationship between the z-position relative to the focal plane and the full-width-at-half-maximum (FWHM) of the axoneme signal, measured normal to the centerline as well as the z-stack of imges (multi-layer tif) used to extract this relation.
A corresponding txt file describes the details of the dataset.
5) Distance between gold nano paricle (GNP) and the axonemal centerline as a function of the beat cycle
Data file (MATLAB) contains 20 measurements of d_C (where d_C is the normal distance between the center position of the GNP and the axoneme centerline in 2D images) as a function of time. A corresponding txt file describes the details of the dataset.
6) Image Raw Data of single isolated and reactivated axonemes with attached GNPs used to measure d_C.
Movie files (multi-layer tif) of reactivated axonemes with attached gold nano particles (GNPs) imaged with darkfield-microscopy.
A corresponding txt file describes the details of the dataset.
7) Code to calculate the average 3D waveform from defocused darkfield movies
MATLAB code used to calculate the average waveform of an axoneme that was recorded with high-speed defocused darkfield microscopy.
A corresponding pdf file (Manual.pdf) describes the details of the procedure in 6 steps.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Normalization
# Generate a resting state (rs) timeseries (ts)
# Install / load package to make fake fMRI ts
# install.packages("neuRosim")
library(neuRosim)
# Generate a ts
ts.rs <- simTSrestingstate(nscan=2000, TR=1, SNR=1)
# 3dDetrend -normalize
# R command version for 3dDetrend -normalize -polort 0 which normalizes by making "the sum-of-squares equal to 1"
# Do for the full timeseries
ts.normalised.long <- (ts.rs-mean(ts.rs))/sqrt(sum((ts.rs-mean(ts.rs))^2));
# Do this again for a shorter version of the same timeseries
ts.shorter.length <- length(ts.normalised.long)/4
ts.normalised.short <- (ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))/sqrt(sum((ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))^2));
# By looking at the summaries, it can be seen that the median values become larger
summary(ts.normalised.long)
summary(ts.normalised.short)
# Plot results for the long and short ts
# Truncate the longer ts for plotting only
ts.normalised.long.made.shorter <- ts.normalised.long[1:ts.shorter.length]
# Give the plot a title
title <- "3dDetrend -normalize for long (blue) and short (red) timeseries";
plot(x=0, y=0, main=title, xlab="", ylab="", xaxs='i', xlim=c(1,length(ts.normalised.short)), ylim=c(min(ts.normalised.short),max(ts.normalised.short)));
# Add zero line
lines(x=c(-1,ts.shorter.length), y=rep(0,2), col='grey');
# 3dDetrend -normalize -polort 0 for long timeseries
lines(ts.normalised.long.made.shorter, col='blue');
# 3dDetrend -normalize -polort 0 for short timeseries
lines(ts.normalised.short, col='red');
Standardization/modernization
New afni_proc.py command line
afni_proc.py \
-subj_id "$sub_id_name_1" \
-blocks despike tshift align tlrc volreg mask blur scale regress \
-radial_correlate_blocks tcat volreg \
-copy_anat anatomical_warped/anatSS.1.nii.gz \
-anat_has_skull no \
-anat_follower anat_w_skull anat anatomical_warped/anatU.1.nii.gz \
-anat_follower_ROI aaseg anat freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \
-anat_follower_ROI aeseg epi freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \
-anat_follower_ROI fsvent epi freesurfer/SUMA/fs_ap_latvent.nii.gz \
-anat_follower_ROI fswm epi freesurfer/SUMA/fs_ap_wm.nii.gz \
-anat_follower_ROI fsgm epi freesurfer/SUMA/fs_ap_gm.nii.gz \
-anat_follower_erode fsvent fswm \
-dsets media_?.nii.gz \
-tcat_remove_first_trs 8 \
-tshift_opts_ts -tpattern alt+z2 \
-align_opts_aea -cost lpc+ZZ -giant_move -check_flip \
-tlrc_base "$basedset" \
-tlrc_NL_warp \
-tlrc_NL_warped_dsets \
anatomical_warped/anatQQ.1.nii.gz \
anatomical_warped/anatQQ.1.aff12.1D \
anatomical_warped/anatQQ.1_WARP.nii.gz \
-volreg_align_to MIN_OUTLIER \
-volreg_post_vr_allin yes \
-volreg_pvra_base_index MIN_OUTLIER \
-volreg_align_e2a \
-volreg_tlrc_warp \
-mask_opts_automask -clfrac 0.10 \
-mask_epi_anat yes \
-blur_to_fwhm -blur_size $blur \
-regress_motion_per_run \
-regress_ROI_PC fsvent 3 \
-regress_ROI_PC_per_run fsvent \
-regress_make_corr_vols aeseg fsvent \
-regress_anaticor_fast \
-regress_anaticor_label fswm \
-regress_censor_motion 0.3 \
-regress_censor_outliers 0.1 \
-regress_apply_mot_types demean deriv \
-regress_est_blur_epits \
-regress_est_blur_errts \
-regress_run_clustsim no \
-regress_polort 2 \
-regress_bandpass 0.01 1 \
-html_review_style pythonic
We used similar command lines to generate ‘blurred and not censored’ and the ‘not blurred and not censored’ timeseries files (described more fully below). We will provide the code used to make all derivative files available on our github site (https://github.com/lab-lab/nndb).We made one choice above that is different enough from our original pipeline that it is worth mentioning here. Specifically, we have quite long runs, with the average being ~40 minutes but this number can be variable (thus leading to the above issue with 3dDetrend’s -normalise). A discussion on the AFNI message board with one of our team (starting here, https://afni.nimh.nih.gov/afni/community/board/read.php?1,165243,165256#msg-165256), led to the suggestion that '-regress_polort 2' with '-regress_bandpass 0.01 1' be used for long runs. We had previously used only a variable polort with the suggested 1 + int(D/150) approach. Our new polort 2 + bandpass approach has the added benefit of working well with afni_proc.py.
Which timeseries file you use is up to you but I have been encouraged by Rick and Paul to include a sort of PSA about this. In Paul’s own words: * Blurred data should not be used for ROI-based analyses (and potentially not for ICA? I am not certain about standard practice). * Unblurred data for ISC might be pretty noisy for voxelwise analyses, since blurring should effectively boost the SNR of active regions (and even good alignment won't be perfect everywhere). * For uncensored data, one should be concerned about motion effects being left in the data (e.g., spikes in the data). * For censored data: * Performing ISC requires the users to unionize the censoring patterns during the correlation calculation. * If wanting to calculate power spectra or spectral parameters like ALFF/fALFF/RSFA etc. (which some people might do for naturalistic tasks still), then standard FT-based methods can't be used because sampling is no longer uniform. Instead, people could use something like 3dLombScargle+3dAmpToRSFC, which calculates power spectra (and RSFC params) based on a generalization of the FT that can handle non-uniform sampling, as long as the censoring pattern is mostly random and, say, only up to about 10-15% of the data. In sum, think very carefully about which files you use. If you find you need a file we have not provided, we can happily generate different versions of the timeseries upon request and can generally do so in a week or less.
Effect on results
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
This table contains 420 series, with data for years 1996/1997 - 2004/2005 (not all combinations necessarily have data for all years), and is no longer being released. This table contains data described by the following dimensions (Not all combinations are available): Geography (12 items: Canada; Newfoundland and Labrador; Prince Edward Island; Nova Scotia; ...), Type of venue (3 items: Total movie theatres and drive-ins; Movie theatres; Drive-ins), Summary characteristics (14 items: Number of theatres; Paid admissions; Average ticket prices; Number of screens; ...).
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Representative time-lapse movie of a normal mouse mammary fragment in collagen I. CIL 42168 is a related movie of a normal mammary fragment in Matrigel. Images taken every 20 min. This movie is part of a group of movies that include CIL 42151-42168.
We introduce InfiniBench a comprehensive benchmark for very long video understanding, which presents 1) The longest video duration, averaging 76.34 minutes; 2) The largest number of question-answer pairs, 108.2K; 3) Diversity in questions that examine nine different skills and include both multiple-choice questions and open-ended questions; 4) Humancentric, as the video sources come from movies and daily TV shows, with specific human-level question designs such as Movie Spoiler Questions that require critical thinking and comprehensive understanding. Using InfiniBench, we comprehensively evaluate existing Large MultiModality Models (LMMs) on each skill, including the commercial model Gemini 1.5 Flash and the open-source models. The evaluation shows significant challenges in our benchmark.Our results show that the best AI models such Gemini struggles to perform well with 42.72% average accuracy and 2.71 out of 5 average score.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Dataset Card for [Dataset Name]
Dataset Summary
The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. The corpus is based on the dataset introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/sst2.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The moviesAnalyzed.csv file is a comma-separatede-value file with thedata used in Ghirlanda S, Acerbi A, Herzog H, "Dog movie stars and dogbreed popularity," currently under review at Proceedings of the RoyalSociety of Lomdon, B. The columns in the file have the meaning given below. When a piece ofinformation was not found or cannot be computed, it is given as NA(see paper for possible reasons).
dog: name of the dog actor breed: the portrayed dog's breed year: the year of movie release title: the movie title earnings1: movie earnings during the opening weekend (in 2012 USD) earnings: total movie earnings (in 2012 USD) disney: whether the movies has been produced by the Walt Disney Company before[n]: the n-year popularity trend of the considered breed beforemovie release after[n]: the n-year popularity trend of the considered breed aftermovie release popularity[n]: average number of registrations for the consideredbreed in the 2n+1 years around movie release (between n years beforeand n years after) effect[n]: the n-year effect of the movie on the breed's popularity trend excess[n]: registrations of the considered breed attributable to movierelease (actual registrations over the n years after movie releaseminus registrations predicted based on the trend observed n yearsbefore movie release) viewers: estimated number of people who saw the movie viewers1: estimated number of people who saw the movie over itsopening weekend
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Title: 9,565 Top-Rated Movies Dataset
Description:
This dataset offers a comprehensive collection of 9,565 of the highest-rated movies according to audience ratings on the Movie Database (TMDb). The dataset includes detailed information about each movie, such as its title, overview, release date, popularity score, average vote, and vote count. It is designed to be a valuable resource for anyone interested in exploring trends in popular cinema, analyzing factors that contribute to a movie’s success, or building recommendation engines.
Key Features:
- Title: The official title of each movie.
- Overview: A brief synopsis or description of the movie's plot.
- Release Date: The release date of the movie, formatted as YYYY-MM-DD
.
- Popularity: A score indicating the current popularity of the movie on TMDb, which can be used to gauge current interest.
- Vote Average: The average rating of the movie, based on user votes.
- Vote Count: The total number of votes the movie has received.
Data Source:
The data was sourced from the TMDb API, a well-regarded platform for movie information, using the /movie/top_rated
endpoint. The dataset represents a snapshot of the highest-rated movies as of the time of data collection.
Data Collection Process:
- API Access: Data was retrieved programmatically using TMDb’s API.
- Pagination Handling: Multiple API requests were made to cover all pages of top-rated movies, ensuring the dataset’s comprehensiveness.
- Data Aggregation: Collected data was aggregated into a single, unified dataset using the pandas
library.
- Cleaning: Basic data cleaning was performed to remove duplicates and handle missing or malformed data entries.
Potential Uses: - Trend Analysis: Analyze trends in movie ratings over time or compare ratings across different genres. - Recommendation Systems: Build and train models to recommend movies based on user preferences. - Sentiment Analysis: Perform text analysis on movie overviews to understand common themes and sentiments. - Statistical Analysis: Explore the relationship between popularity, vote count, and average ratings.
Data Format: The dataset is provided in a structured tabular format (e.g., CSV), making it easy to load into data analysis tools like Python, R, or Excel.
Usage License: The dataset is shared under [appropriate license], ensuring that it can be used for educational, research, or commercial purposes, with proper attribution to the data source (TMDb).
This description provides a clear and detailed overview, helping potential users understand the dataset's content, origin, and potential applications.