19 datasets found

Z
Film Circulation dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Loist, Skadi; Samoilova, Evgenia (Zhenya) (2024). Film Circulation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7887671
Explore at:
Dataset updated
Jul 12, 2024
Dataset provided by
Film University Babelsberg KONRAD WOLF
Authors
Loist, Skadi; Samoilova, Evgenia (Zhenya)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.

The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This

Louisville Metro KY - Open Data Data Set Inventory Updated for 2022

datasets.ai
data.louisvilleky.gov
+4more

0, 15, 21, 25, 3, 47 +3

Updated Apr 13, 2023

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Louisville Metro Government (2023). Louisville Metro KY - Open Data Data Set Inventory Updated for 2022 [Dataset]. https://datasets.ai/datasets/louisville-metro-ky-open-data-data-set-inventory-updated-for-2022-286a7

Explore at:

57, 0, 15, 3, 21, 8, 47, 25, 53Available download formats

Dataset updated

Apr 13, 2023

Dataset authored and provided by

Louisville Metro Government

Area covered

Louisville, Kentucky

Description

This data aligns with WWC Certification requirements, and serves as the basis for our data warehouse and open data roadmap. It's a continual work in progress across all departments.

Louisville Metro Technology Services builds data and technology platforms to ready our government for our community’s digital future.

Data Dictionary:

Field Name	Description
Dataset Name	The official title of the dataset as listed in the inventory.
Brief Description of Data	A short summary explaining the contents and purpose of the dataset.
Data Source	The origin or system from which the data is collected or generated.
Home Department	The primary department responsible for the dataset.
Home Department Division	The specific division within the department that manages the dataset.
Data Steward (Business) Name	The name of person responsible for the dataset’s accuracy and relevance.
Data Custodian (Technical) Name)	The technical contact responsible for maintaining and managing the dataset infrastructure.
Data Classification	The sensitivity level of the data (e.g., Public, Internal, Confidential)
Data Format	The file format(s) in which the dataset is available (e.g., CSV, JSON, Shapefile).
Frequency of Data Change	How often the dataset is updated (e.g., Daily, Weekly, Monthly, Annually).
Time Spam	The overall time period the dataset covers.
Start Date	The beginning date of the data

S
Global scientific academies Dataset
scidb.cn
Updated Nov 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
chen xiaoli (2024). Global scientific academies Dataset [Dataset]. http://doi.org/10.57760/sciencedb.14674
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.14674
Dataset updated
Nov 18, 2024
Dataset provided by
Science Data Bank
Authors
chen xiaoli
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset was generated as part of the study aimed at profiling global scientific academies, which play a significant role in promoting scholarly communication and scientific progress. Below is a detailed description of the dataset:Data Generation Procedures and Tools: The dataset was compiled using a combination of web scraping, manual verification, and data integration from multiple sources, including Wikipedia categories,member of union of scientific organizations, and web searches using specific query phrases (e.g., "country name + (academy OR society) AND site:.country code"). The records were enriched by cross-referencing data from the Wikidata API, the VIAF API, and the Research Organisation Registry (ROR). Additional manual curation ensured accuracy and consistency.Temporal and Geographical Scopes: The dataset covers scientific academies from a wide temporal scope, ranging from the 15th century to the present. The geographical scope includes academies from all continents, with emphasis on both developed and post-developing countries. The dataset aims to capture the full spectrum of scientific academies across different periods of historical development.Tabular Data Description: The dataset comprises a total of 301 academy records and 14,008 website navigation sections. Each row in the dataset represents a single scientific academy, while the columns describe attributes such as the academy’s name, founding date, location (city and country), website URL, email, and address.Missing Data: Although the dataset offers comprehensive coverage, some entries may have missing or incomplete fields. For instance, section was not available for all records.Data Errors and Error Ranges: The data has been verified through manual curation, reducing the likelihood of errors. However, the use of crowd-sourced data from platforms like Wikipedia introduces potential risks of outdated or incomplete information. Any errors are likely minor and confined to fields such as navigation menu classifications, which may not fully reflect the breadth of an academy's activities.Data Files, Formats, and Sizes: The dataset is provided in CSV format and JSON format, ensuring compatibility with a wide range of software applications, including Microsoft Excel, Google Sheets, and programming languages such as Python (via libraries like pandas).This dataset provides a valuable resource for further research into the organizational behaviors, geographic distribution, and historical significance of scientific academies across the globe. It can be used for large-scale analyses, including comparative studies across different regions or time periods.Any feedback on the data is welcome! Please contact the maintaner of the dataset!If you use the data, please cite the following paper:Xiaoli Chen and Xuezhao Wang. 2024. Profiling Global Scientific Academies. In The 2024 ACM/IEEE Joint Conference on Digital Libraries (JCDL ’24), December 16–20, 2024, Hong Kong, China. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3677389.3702582
B
UNI-CEN Standardized Census Data Table - Census Division (CD) - 1986 - Long...
borealisdata.ca
dataone.org
Updated Apr 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UNI-CEN Project (2023). UNI-CEN Standardized Census Data Table - Census Division (CD) - 1986 - Long Format (DTA) (Version 2023-03) [Dataset]. http://doi.org/10.5683/SP3/QVOT0Y
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/QVOT0Y
Dataset updated
Apr 4, 2023
Dataset provided by
Borealis
Authors
UNI-CEN Project
License
https://borealisdata.ca/api/datasets/:persistentId/versions/4.0/customlicense?persistentId=doi:10.5683/SP3/QVOT0Yhttps://borealisdata.ca/api/datasets/:persistentId/versions/4.0/customlicense?persistentId=doi:10.5683/SP3/QVOT0Y
Time period covered
Jan 1, 1986
Area covered
Canada
Description
UNI-CEN Standardized Census Data Tables contain Census data that have been reformatted into a common table format with standardized variable names and codes. The data are provided in two tabular formats for different use cases. "Long" tables are suitable for use in statistical environments, while "wide" tables are commonly used in GIS environments. The long tables are provided in Stata Binary (dta) format, which is readable by all statistics software. The wide tables are provided in comma-separated values (csv) and dBase 3 (dbf) formats with codebooks. The wide tables are easily joined to the UNI-CEN Digital Boundary Files. For the csv files, a .csvt file is provided to ensure that column data formats are correctly formatted when importing into QGIS. A schema.ini file does the same when importing into ArcGIS environments. As the DBF file format supports a maximum of 250 columns, tables with a larger number of variables are divided into multiple DBF files. For more information about file sources, the methods used to create them, and how to use them, consult the documentation at https://borealisdata.ca/dataverse/unicen_docs. For more information about the project, visit https://observatory.uwo.ca/unicen.
F
Japanese Chain of Thought Prompt & Response Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Japanese Chain of Thought Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/japanese-chain-of-thought-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Welcome to the Japanese Chain of Thought prompt-response dataset, a meticulously curated collection containing 3000 comprehensive prompt and response pairs. This dataset is an invaluable resource for training Language Models (LMs) to generate well-reasoned answers and minimize inaccuracies. Its primary utility lies in enhancing LLMs' reasoning skills for solving arithmetic, common sense, symbolic reasoning, and complex problems.
Dataset Content
This COT dataset comprises a diverse set of instructions and questions paired with corresponding answers and rationales in the Japanese language. These prompts and completions cover a broad range of topics and questions, including mathematical concepts, common sense reasoning, complex problem-solving, scientific inquiries, puzzles, and more.
Each prompt is meticulously accompanied by a response and rationale, providing essential information and insights to enhance the language model training process. These prompts, completions, and rationales were manually curated by native Japanese people, drawing references from various sources, including open-source datasets, news articles, websites, and other reliable references.
Our chain-of-thought prompt-completion dataset includes various prompt types, such as instructional prompts, continuations, and in-context learning (zero-shot, few-shot) prompts. Additionally, the dataset contains prompts and completions enriched with various forms of rich text, such as lists, tables, code snippets, JSON, and more, with proper markdown format.
Prompt Diversity
To ensure a wide-ranging dataset, we have included prompts from a plethora of topics related to mathematics, common sense reasoning, and symbolic reasoning. These topics encompass arithmetic, percentages, ratios, geometry, analogies, spatial reasoning, temporal reasoning, logic puzzles, patterns, and sequences, among others.
These prompts vary in complexity, spanning easy, medium, and hard levels. Various question types are included, such as multiple-choice, direct queries, and true/false assessments.
Response Formats
To accommodate diverse learning experiences, our dataset incorporates different types of answers depending on the prompt and provides step-by-step rationales. The detailed rationale aids the language model in building reasoning process for complex questions.
These responses encompass text strings, numerical values, and date and time formats, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.
Data Format and Annotation Details
This fully labeled Japanese Chain of Thought Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt complexity, prompt category, domain, response, rationale, response type, and rich text presence.
Quality and Accuracy
Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses and rationales are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.
The Japanese version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.
Continuous Updates and Customization
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom chain of thought prompt completion data tailored to specific needs, providing flexibility and customization options.
License
The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Japanese Chain of Thought Prompt Completion Dataset to enhance the rationale and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.

Dog Breeds

kaggle.com

zip

Updated Sep 22, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Sujay Kapadnis (2023). Dog Breeds [Dataset]. https://www.kaggle.com/datasets/sujaykapadnis/dog-breeds/code

Explore at:

zip(27212 bytes)Available download formats

Dataset updated

Sep 22, 2023

Authors

Sujay Kapadnis

Description

The data comes from the American Kennel Club courtesy breed_traits - trait information on each dog breed and scores for each trait (wide format) trait_description - long descriptions of each trait and values corresponding to Trait_Score breed_rank_all- popularity of dog breeds by AKC registration statistics from 2013-2020

Data Dictionary

`breed_traits_long.csv`

variable	class	description
Breed	character	Dog Breed
Trait	character	Name of trait/characteristic
Trait_Score	character	Placement on scale of 1-5 for the trait, with the exception of a description for coat type and length

`breed_traits.csv`

variable	class	description
Breed	character	Dog Breed
Affectionate With Family	character	Placement on scale of 1-5 for the breed's tendancy to be "Affectionate With Family" (Trait_Score)
Good With Young Children	character	Placement on scale of 1-5 for the breed's tendancy to be "Good With Young Children" (Trait_Score)
Good With Other Dogs	character	Placement on scale of 1-5 for the breed's tendancy to be "Good With Other Dogs" (Trait_Score)
Shedding Level	character	Placement on scale of 1-5 for the breed's "Shedding Level" (Trait_Score)
Coat Grooming Frequency	character	Placement on scale of 1-5 for the breed's "Coat Grooming Frequency" (Trait_Score)
Drooling Level	character	Placement on scale of 1-5 for the breed's "Drooling Level" (Trait_Score)
Coat Type	character	Description of the breed's coat type (Trait_Score)
Coat Length	character	Description of the breed's coat length (Trait_Score)
Openness To Strangers	character	Placement on scale of 1-5 for the breed's tendancy to be open to strangers (Trait_Score)
Playfulness Level	character	Placement on scale of 1-5 for the breed's tendancy to be playful (Trait_Score)
Watchdog/Protective Nature	character	Placement on scale of 1-5 for the breed's "Watchdog/Protective Nature" (Trait_Score)
Adaptability Level	character	Placement on scale of 1-5 for the breed's tendancy to be adaptable (Trait_Score)
Trainability Level	character	Placement on scale of 1-5 for the breed's tendancy to be adaptable (Trait_Score)
Energy Level	character	Placement on scale of 1-5 for the breed's "Energy Level" (Trait_Score)
Barking Level	character	Placement on scale of 1-5 for the breed's "Barking Level" (Trait_Score)
Mental Stimulation Needs	character	Placement on scale of 1-5 for the breed's "Mental Stimulation Needs" (Trait_Score)

`trait_description.csv`

variable	class	description
Trait	character	Dog Breed
Trait_1	character	Value corresponding to `Trait` when `Trait_Score` = 1
Trait_5	character	Value corresponding to `Trait` when `Trait_Score` = 5
Description	character	Long description of trait

`breed_rank_all.csv`

variable	class	description
Breed	character	Dog Breed
2013 Rank	character	Popularity of breed based on AKC registration statistics in 2013
2014 Rank	character	Popularity of breed based on AKC registration statistics in 2014
2015 Rank	character	Popularity of breed based on AKC registration statistics in 2015
2016 Rank	character	Popularity of breed based on AKC registration statistics in 2016
2017 Rank	character	Popularity of breed based on AKC registration statistics in 2017
2018 Rank	character	Popularity of breed based on AKC registration statistics in 2018
2019 Rank	character	Popularity of breed based on AKC registration statistics in 2019
2020 Rank	character	Popularity of breed based on AKC registration statistics in 2020
links	character	Link to the dog breed's AKC webpage
Image	character	Link to image of dog breed

AI in Business – EU contextual regulations & future workforce survey data...

zenodo.org

Updated Sep 1, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Antonio Clim; Antonio Clim (2025). AI in Business – EU contextual regulations & future workforce survey data (dataset) [Dataset]. http://doi.org/10.5281/zenodo.17022049

Explore at:

Unique identifier

https://doi.org/10.5281/zenodo.17022049

Dataset updated

Sep 1, 2025

Dataset provided by

Zenodo

Authors

Antonio Clim; Antonio Clim

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Nov 15, 2023

Description

DATASET OVERVIEW

This dataset is a CSV file containing responses from an online survey titled “AI in Business – EU Contextual Regulations and Future Workforce Impacts.” The survey ran from late 2023 through 2025, gathering over 300 responses from participants (primarily in Romania and other EU regions). Each row in the CSV represents one respondent’s answers. Questions span multiple-choice and Likert scale responses, divided into three sections: (A) demographics and background, (B) perceptions of AI in business and relevant EU regulations, and (C) motivations and experiences with AI adoption in business. The questionnaire was provided in both English and Romanian to maximize clarity and participation in a bilingual context[1]. Table 1 below summarizes the structure of the dataset, listing each field (column) and its contents or question topic.

Table 1: Survey Dataset Structure

<tr

Timestamp Date and time when the response was submitted.

Email address Email address of the respondent.

Age group Respondent's age group (e.g., 18–24, 25–34, etc.).

Gender Gender of the respondent (Female, Male, Prefer not to say).

Highest level of education Highest educational attainment of the respondent (High school or less; Bachelor’s/Master’s; Doctorate or above; Vocational; Prefer not to say).

Current job role Current job role or position of the respondent (Leadership/management; Professional/technical; Administrative/support; Sales/marketing; Student; Retired; Unemployed).

Sector Industry sector in which the respondent currently works (e.g., Technology, Healthcare, Finance, Education, Manufacturing, Public, Retail, Other).

Years of experience Number of years the respondent has been working in their current field.

AI knowledge Self-rated knowledge and understanding of AI technology (No knowledge; Beginner; Intermediate; Advanced; Expert).

Region Region (in Romania or EU) where the respondent currently resides.

AI familiarity Familiarity with the concept of AI in a business context (1 = Not very familiar, 10 = Very familiar).

AI improves operations Belief that AI can significantly improve business operations (1 = Strongly disagree, 10 = Strongly agree).

EU AI regulations too restrictive Opinion on whether current EU regulations on AI in business are too restrictive (1 = Strongly disagree, 10 = Strongly agree).

AI impact on job creation Perceived impact of AI on job creation in the EU (1 = Not important, 10 = Significantly positive).

AI job losses Expectation that AI will lead to job losses in the EU (1 = Definitely will not, 10 = Definitely will).

EU ethical AI leadership Belief that the EU is well-positioned to lead globally in ethical AI practices (1 = Strongly disagree, 10 = Strongly agree).

Data privacy regs effective Perception of how effectively EU regulations address data privacy concerns in AI (1 = Very ineffectively, 10 = Very effectively).

AI in decision-making stance Stance on AI’s role in business decision-making processes (1 = Fully oppose, 10 = Fully support).

Invest in AI education Support for investing more in AI education and workforce training (1 = Definitely no, 10 = Definitely yes).

AI creates new jobs Belief that AI can lead to new types of jobs not existing today (1 = Strongly do not believe, 10 = Strongly believe).

Transparency of AI processes Perception of how transparent AI-driven processes are in businesses (1 = Completely opaque, 10 = Completely transparent).

Confidence in AI data handling Confidence in AI’s ability to ethically handle sensitive data (1 = Not confident at all, 10 = Very confident).

AI widens SME gap Belief that AI technologies will widen the gap between large corporations and SMEs (1 = Strongly disagree, 10 = Strongly agree).

Future AI outlook (EU) Outlook on the future of AI in EU business over the next decade (1 = Extremely pessimistic, 5 = Extremely optimistic).

AI integration cost perception Perceived cost of integrating AI into operations for SMEs (1 = Very affordable, 10 = Extremely costly).

Primary motivation for AI adoption Primary motivation for adopting AI in the business (e.g., Improving efficiency; Customer experience; Competitive advantage; Cost reduction).

AI helps navigate uncertainties Belief that AI can help the business navigate market uncertainties more effectively (1 = Strongly disagree, 10 = Strongly agree).

Awareness of EU AI regulations Self-assessed understanding and awareness of EU AI regulations and their impact (1 = Not aware, 10 = Fully aware).

Challenges in data access for AI Whether the business has faced challenges in accessing or analyzing data for AI applications (Frequently; Occasionally; Rarely; Never; Not applicable).

AI critical for SME future Perceived criticality of AI in shaping future strategies and business models for SMEs (Extremely critical; Important; Somewhat important; Unimportant; Unsure).

Plan to invest in AI (1-2 yrs) Plan to invest in AI technologies or tools in the next 1–2 years (Yes, definitely; Considering; Unsure; Unlikely; No plans).

Support needed for AI adoption Type of support or resources perceived to most help SMEs adopt AI effectively (Financial grants; Training resources; Government policies; Technical assistance).

AI impact on job roles Perceived impact of AI on employee roles in the business (Create new roles; Augment existing roles; Replace certain roles; No significant impact; Unsure).

Unified database of ozonesounding profiles

zenodo.org

zip

Updated Sep 12, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Fabrizio Marra; Fabrizio Marra; FABIO MADONNA; FABIO MADONNA; Emanuele Tramutola; Emanuele Tramutola (2025). Unified database of ozonesounding profiles [Dataset]. http://doi.org/10.5281/zenodo.17094116

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.17094116

Dataset updated

Sep 12, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Fabrizio Marra; Fabrizio Marra; FABIO MADONNA; FABIO MADONNA; Emanuele Tramutola; Emanuele Tramutola

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

2025

Description

The unified database of ozonesounding profiles was obtained through the merging of three existing ozonesounding datasets, provided by the Southern Hemisphere Additional OZonesondes (SHADOZ), the Network for the Detection of Atmospheric Composition Change (NDACC), and the World Ozone and Ultraviolet Radiation Data Centre (WOUDC).

Only a selected set of variables of interest, both data and metadata, were considered to build the unified dataset, due to the heterogeneous formats and varying levels of detail provided by each network, even when referring to measurements shared across different initiatives. These variables are listed in the following Table.

Standard name	Description	Unit
idstation	The name of the station.	N.A.
location_latitude	Latitude of station.	deg
location_longitude	Longitude of station.	deg
lacation_height	Height is defined as the altitude, elevation, or height of the defined platform + instrument above sea level.	m
date_of_observation	Date when the ozonesonde was launched (in format yyyy-mm-dd hh:mm:ss with time zone).	N.A.
time	Elapsed flight time since released.	s
pressure	Atmospheric pressure of each level in Pascals.	Pa
geop_alt	Geopotential height in meters.	m
temperature	Air temperature in degrees Kelvin.	K
relative_humidity	Relative humidity in 1.	1
wind_speed	Wind speed in meters per seconds.	m/s
wind_direction	Wind direction in degrees.	deg
latitude	Observation latitude (during the flight).	deg
longitude	Observation longitude (during the flight).	deg
altitude	Height of sensor above local ground or sea surface. Positive values for above surface (e.g., sondes), negative for below (e.g., xbt). For visual observations, the height of the visual observing platform.	m (a. s. l.)
sample_temperature	Temperature where sample is measured in degrees Kelvin.	K
o3_partial_pressure	The level partial pressure of ozone in Pascals.	Pa
ozone_concentration	The level mixing ratio of ozone in ppmv.	ppmv
ozone_partial_pressure_total_uncertainty	Total uncertainty in the calculation of the ozone partial pressure as a composite of the individual uncertainty contribution. Uncertainties due to systematic bias are assumed as random and follow a random normal distribution. The uncertainty calculation also accounts for the increased uncertainty incurred by homogenizing the data record.	Pa
network	Source network of the profile.	N.A.
type	Station classification flag.	N.A.
filter_check	Profile quality control flag.	N.A.

The dataset is organized into two main tables:

unified_header, which contains metadata associated with each ozonesounding profile (idstation, date_of_observation, location_latitude, location_longitude, location_height, network, type, filter_check);
unified_value, which includes the actual measurement data (idstation, date_of_observation, time, pressure, geop_alt, temperature, relative_humidity, wind_speed, wind_direction, latitude, longitude, altitude, sample_temperature, o3_partial_pressure, ozone_concentration, ozone_partial_pressure_total_uncertainty).

To improve accessibility and performance, both tables are further subdivided into year-specific subtables, allowing for more efficient querying and data management across temporal ranges.

Among the metadata variables included in the unified_header table, type and filter_check play a key role in characterizing the quality and coverage of the ozonesounding profiles. The type variable classifies each station based on the continuity of its time series: stations are grouped into Long Coverage (G), Medium Coverage (Y), or Short Coverage (R), depending on whether they provide at least one profile per month for at least 95% of the months in their time series, spanning:

≥20 years for Long Coverage,
≥10 and <20 years for Medium Coverage,
<10 years for Short Coverage.

The filter_check variable is a quality control flag ranging from 0 to 3, summarizing the results of three structural checks applied to each profile: completeness of monthly coverage (at least three ascents per month), vertical coverage (reaching at least 10 hPa), and vertical resolution (minimum one data point every 100 meters). A higher filter_check value indicates better compliance with these criteria and, consequently, higher data reliability.

Furthermore, an algorithm was implemented able to merge the different datasets by handling their different

Data from: cigChannel: A large-scale 3D seismic dataset with labeled...

zenodo.org

zip

Updated Aug 1, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Guangyu Wang; Xinming Wu; Wen Zhang; Guangyu Wang; Xinming Wu; Wen Zhang (2025). cigChannel: A large-scale 3D seismic dataset with labeled paleochannels for advancing deep learning in seismic interpretation [Dataset]. http://doi.org/10.5281/zenodo.11044512

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.11044512

Dataset updated

Aug 1, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Guangyu Wang; Xinming Wu; Wen Zhang; Guangyu Wang; Xinming Wu; Wen Zhang

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Assorted channel subset of the cigChannel dataset (V1.0)

cigChannel (V1.0) is a dataset created by the Computational Interpretation Group (CIG) for the deep-learning-based paleochannel interpretation in 3D seismic volumes. Guangyu Wang, Xinming Wu and Wen Zhang are the main contributors to the dataset.

cigChannel (V1.0) contains 1,600 synthetic 3D seismic volumes with labels of meandering channels, tributary channel networks and submarine canyons. Seismic impedance and sedimentary facies (only for submarine canyons) volumes correspond to the seismic volumes are also included in this dataset. Components of this dataset are listed below:

Subset name	Sample amount & size	Contents	Features
Meandering channel	400, 256x256x256	Seismic volumes (float32) Binary-class label volumes (uint8) Seismic impedance volume (float32)	Meandering channels. Horizontal, inclined, folded and faulted structures. Noise-free.
Tributary channel network (Formerly distributary channel)	400, 256x256x256	Seismic volumes (float32) Binary-class label volumes (uint8) Seismic impedance volume (float32)	Tributary channel networks. Horizontal, inclined, folded and faulted structures. Noise-free.
Submarine canyon (Formerly submarine channel)	400, 256x256x256	Seismic volumes (float32) Binary-class label volumes (uint8) Seismic impedance volumes (float32) Sedimentary facies volumes (int16)	Submarine canyons. Horizontal, inclined, folded and faulted structures. Noise-free.
Assorted channel	400, 256x256x256	Seismic volumes (float32) Multi-class label volumes (int16) Seismic impedance volumes (float32)	Meandering, tributary channel networks and submarine canyons. Horizontal, inclined, folded and faulted structures. Noise-free.

Further details about this dataset are available in our paper published in Earth System Science Data:

Wang, G., Wu, X., and Zhang, W.: cigChannel: a large-scale 3D seismic dataset with labeled paleochannels for advancing deep learning in seismic interpretation, Earth Syst. Sci. Data, 17, 3447–3471, https://doi.org/10.5194/essd-17-3447-2025, 2025.

Due to the size limitation of the uploaded files, we have to publish the dataset in separated versions. This version includes the assorted channel subset, which contains the following zip files:

Assorted_Channel_Ip_xx-xx.zip: Seismic impedance volumes of sample No.xx to No.xx.
Assorted_Channel_Label_xx-xx.zip: Multi-class label volumes of sample No.xx to No.xx, where the value 0 represents the background (non-channel areas), 101 represents meandering channel No.1, 102 represents meandering channel No.2, etc., 201 represents distributary channel network No.1, 202 represents distributary channel network No.2, etc., and 301 represents the submarine channel (only one submarine channel per volume).
Assorted_Channel_Seismic_xx-xx.zip: Seismic (amplitude) volumes of sample No.xx to No.xx.

Samples in this subset feature different geologic structures:

Sample No. 0 to No. 49 feature horizontal structure.
Sample No. 50 to No. 99 feature inclined structure.
Sample No. 100 to No. 299 feature folded structure.
Sample No. 300 to No. 399 feature folded and faulted structures (uploaded as an expansion package).

Portal to the expansion package: https://doi.org/10.5281/zenodo.15500696

Portals to the other subsets:

Tributary channel network subset: https://doi.org/10.5281/zenodo.11073030
Meandering channel subset: https://doi.org/10.5281/zenodo.11078794
Submarine canyon subset: https://doi.org/10.5281/zenodo.11079950

Harmonized global datasets of soil carbon and heterotrophic respiration from...

zenodo.org

bin, nc, txt

Updated Oct 7, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Shoji Hashimoto; Shoji Hashimoto; Akihiko Ito; Akihiko Ito; Kazuya Nishina; Kazuya Nishina (2025). Harmonized global datasets of soil carbon and heterotrophic respiration from data-driven estimates, with derived turnover time and Q10 [Dataset]. http://doi.org/10.5281/zenodo.17282577

Explore at:

nc, txt, binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.17282577

Dataset updated

Oct 7, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Shoji Hashimoto; Shoji Hashimoto; Akihiko Ito; Akihiko Ito; Kazuya Nishina; Kazuya Nishina

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

We collected all available global soil carbon (C) and heterotrophic respiration (R_H) maps derived from data-driven estimates, sourcing them from public repositories and supplementary materials of previous studies (Table 1). All spatial datasets were converted to NetCDF format for consistency and ease of use.

Because the maps had varying spatial resolutions (ranging from 0.0083° to 0.5°), we harmonized all datasets to a common resolution of 0.5° (approximately 50 km at the equator). We then merged the processed maps by computing the mean, maximum, and minimum values at each grid cell, resulting in harmonized global maps of soil C (for the top 0–30 cm and 0–100 cm depths) and R_H at 0.5° resolution.

Grid cells with fewer than three soil C estimates or fewer than four R_H estimates were assigned NA values. Land and water grid cells were automatically distinguished by combining multiple datasets containing soil C and R_H information over land.

Soil carbon turnover time (years), denoted as τ, was calculated under the assumption of a quasi-equilibrium state using the formula:

τ = C_S / R_H

where C_S is soil carbon stock and R_H is the heterotrophic respiration rate. The uncertainty range of τ was estimated for each grid cell using:

τ_max = C_S⁺ / R_H⁻ τ_min = C_S⁻ / R_H⁺

where C_S⁺ and C_S⁻ are the maximum and minimum soil C values, and R_H⁺ and R_H⁻ are the maximum and minimum R_H values, respectively.

To calculate the temperature sensitivity of decomposition (Q₁₀)—the factor by which decomposition rates increase with a 10 °C rise in temperature—we followed the method described in Koven et al. (2017). The uncertainty of Q₁₀ (maximum and minimum values) was derived using τ_max and τ_min, respectively.

All files are provided in NetCDF format. The SOC file includes the following variables:
· longitude, latitude
· soc: mean soil C stock (kg C m⁻²)
· soc_median: median soil C (kg C m⁻²)
· soc_n: number of estimates per grid cell
· soc_max, soc_min: maximum and minimum soil C (kg C m⁻²)
· soc_max_id, soc_min_id: study IDs corresponding to the maximum and minimum values
· soc_range: range of soil C values
· soc_sd: standard deviation of soil C (kg C m⁻²)
· soc_cv: coefficient of variation (%)
The R_H file includes:
· longitude, latitude
· rh: mean R_H (g C m⁻² yr⁻¹)
· rh_median, rh_n, rh_max, rh_min: as above
· rh_max_id, rh_min_id: study IDs for max/min
· rh_range, rh_sd, rh_cv: analogous variables for R_H
The mean, maximum, and minimum values of soil C turnover time are provided as separate files. The Q₁₀ files contain estimates derived from the mean values of soil C and R_H, along with associated uncertainty values.

The harmonized dataset files available in the repository are as follows:

· harmonized-RH-hdg.nc: global soil heterotrophic respiration map

· harmonized-SOC100-hdg.nc: global soil C map for 0–100 cm

· harmonized-SOC30-hdg.nc: global soil C map for 0–30 cm

· Q10.nc: global Q10 map

· Turnover-time_max.nc: global soil C turnover time estimated using maximum soil C and minimum R_H

· Turnover-time_min.nc: global soil C turnover time estimated using minimum soil C and maximum R_H

· Turnover-time_mean.nc: global soil C turnover time estimated using mean soil C and R_H

· Turnover-time30_mean.nc: global soil C turnover time estimated using the soil C map for 0-30 cm

Version history
Version 1.1: Median values were added. Bug fix for SOC30 (n>2 was inactive in the former version)

More details are provided in: Hashimoto S. Ito, A. & Nishina K. (in revision) Harmonized global soil carbon and respiration datasets with derived turnover time and temperature sensitivity. Scientific Data

Reference

Koven, C. D., Hugelius, G., Lawrence, D. M. & Wieder, W. R. Higher climatological temperature sensitivity of soil carbon in cold than warm climates. Nat. Clim. Change 7, 817–822 (2017).

Table1 : List of soil carbon and heterotrophic respiration datasets used in this study.

<td style="width:

Dataset	Repository/References (Dataset name)	Depth	ID in NetCDF file***
Global soil C	Global soil data task 2000 (IGBP-DIS)¹	0–100	3,-
	Shangguan et al. 2014 (GSDE)^2,3	0–100, 0–30*	1,1
	Batjes 2016 (WISE30sec)^4,5	0–100, 0–30	6,7
	Sanderman et al. 2017 (Soil-Carbon-Debt) ^6,7	0–100, 0–30	5,5
	Soilgrids team and Hengl et al. 2017 (SoilGrids)^8,9	0–30**	-,6
	Hengl and Wheeler 2018 (LandGIS)¹⁰	0–100, 0–30	4,4
	FAO 2022 (GSOC)¹¹	0–30	-,2
	FAO 2023 (HWSD2)¹²	0–100, 0–30	2,3
Circumpolar soil C	Hugelius et al. 2013 (NCSCD)^13–15	0–100, 0–30	7,8
Global R_H	Hashimoto et al. 2015^16,17	-	1
	Warner et al. 2019 (Bond-Lamberty equation based)^18,19	-	2
	Warner et al. 2019 (Subke equation based)^18,19	-	3
	Tang et al. 2020^20,21	-	4
	Lu et al. 2021^22,23	-	5
	Stell et al. 2021^24,25	-

YouTube Comments Data
kaggle.com
zip
Updated Jun 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manjit Baishya (2025). YouTube Comments Data [Dataset]. https://www.kaggle.com/datasets/manjitbaishya001/youtube-comments-data
Explore at:
zip(2927759 bytes)Available download formats
Dataset updated
Jun 2, 2025
Authors
Manjit Baishya
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
YouTube
Description
🧠 YouTube Comments Dataset: Doctor Mike vs 20 Anti-Vaxxers | Jubilee

This dataset contains raw and labeled YouTube comments from the Jubilee video titled "Doctor Mike vs 20 Anti-Vaxxers | Surrounded" up to June 2, 2025. The comments reflect a wide range of opinions, sentiments, and arguments related to vaccination, medical science, and public discourse. This makes the dataset particularly valuable for Natural Language Processing (NLP) tasks in real-world, socially charged contexts.

📌 Dataset Overview

Video Title: Doctor Mike vs 20 Anti-Vaxxers | Surrounded

Published by: Jubilee

Platform: YouTube

Collected up to: June 2, 2025

Language: English

Format: CSV or JSON (depending on upload)

Licensing: Public comments on a public platform (refer to YouTube Terms of Service for downstream usage)

🧪 Key Use Cases

This dataset is ideal for a wide range of NLP tasks:

🧠 Sentiment Analysis: Classify user opinions into positive, negative, neutral, or irrelevant.

🎯 Toxic Comment Classification: Detect hate speech, misinformation, and emotionally charged content.

🧵 Argument Mining: Identify claims, premises, and conclusions in discussions.

🗣️ Opinion Summarization: Summarize mass opinions from large-scale discourse.

📊 Trend Analysis: Analyze shifts in public opinion regarding vaccines and healthcare narratives.

🔍 Stance Detection: Determine the pro/anti stance of a comment regarding vaccination.

🌐 Multi-label Classification: Assign multiple categories to a comment based on topic, tone, or belief.

📁 Dataset Columns

Column Name Description
text Raw comment text

💡 Why This Dataset?

This dataset offers a real-world sample of social media discourse on a controversial and medically relevant topic. It includes both supportive and oppositional viewpoints and can help train robust, bias-aware NLP models.

Because the video includes professional input from Doctor Mike and diverse opinions from 20 participants with strong anti-vaccine views, the comment section becomes a rich playground for studying digital rhetoric, misinformation, and science communication.

🧰 Suggested Tasks

Binary or multi-class sentiment classification

Toxicity and hate speech detection

Conversational analysis

Keyword or entity extraction

Fine-tuning transformer models (e.g., BERT, RoBERTa)

📎 A Note on Ethics

Please use this dataset responsibly. Public comments may include misinformation or strong personal views. Consider including disclaimers or filters when using this data for deployment or educational use. Always be mindful of bias, representation, and the propagation of harmful narratives.

🔗 Source Acknowledgment

All comments are sourced from the publicly available YouTube comment section of Jubilee’s video. We are not affiliated with Jubilee or Doctor Mike.
Z
Data for A method for assessment of the general circulation model quality...
data.niaid.nih.gov
data.taltech.ee
Updated Mar 10, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maljutenko, Ilja; Raudsepp, Urmas (2021). Data for A method for assessment of the general circulation model quality using k-means clustering algorithm [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4588509
Explore at:
Dataset updated
Mar 10, 2021
Dataset provided by
Department of Marine Systems at Tallinn University of Technology
Authors
Maljutenko, Ilja; Raudsepp, Urmas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset consists of simulated and observed salinity/temperature data which were used in the manuscript "A method for assessment of the general circulation model quality using k-means clustering algorithm" submitted to Geoscientific Model Development. The model simulation dataset is from long-term 3D circulation model simulation (Maljutenko and Raudsepp 2014, 2019). The observations are from the "Baltic Sea - Eutrophication and Acidity aggregated datasets 1902/2017 v2018" SMHI (2018).

The files are in simple comma separated table format without headers. The Dout-t_z_lat_lon_Smod_Sobs_Tmod_Tobs.csv file contains columns with following variables [units]: Time [matlab datenum units], Vertical coordinate [m], latitude [oN], longitude [oE], model salinity [g/kg], observed salinity [g/kg], model temperature [oC], observed temperature [oC].

The Dout-t_z_lat_lon_dS_dT_K1_K2_K3_K4_K5_K6_K7_K8_K9.csv file contains columns with following variables [units]: 4 first columns are the same as in the previous file, salinity error [g/kg], temperature error [oC], columns 7-8 are integers showing the cluster to which the error pair is designated.

do_clust_valid_DataFig.m is a Matlab script which reads the two csv files (and optionally mask file Model_mask.mat), performs the clustering analysis and creates plots which are used in Manuscript. The script is organized into %% blocks which can be executed separately (default: ctrl+enter).

k-means function is used from the Matlab Statistics and Machine Learning Toolbox.

Additional software used in the do_clust_valid_DataFig.m:

Author's auxiliary formatting scripts script/ datetick_cst.m
do_fitfig.m
do_skipticks.m
do_skipticks_y.m

Colormaps are generated using cbrewer.m (Charles, 2021). Moving average smoothing is performed using nanmoving_average.m (Aguilera, 2021).

Refferences:

Aguilera, C. A. V., 2021. moving_average v3.1 (Mar 2008) (https://www.mathworks.com/matlabcentral/fileexchange/12276-moving_average-v3-1-mar-2008), MATLAB Central File Exchange. Retrieved March 2, 2021.

Charles, 2021. cbrewer : colorbrewer schemes for Matlab (https://www.mathworks.com/matlabcentral/fileexchange/34087-cbrewer-colorbrewer-schemes-for-matlab), MATLAB Central File Exchange. Retrieved March 2, 2021.

Maljutenko, I., Raudsepp, U., 2019. Long-term mean, interannual and seasonal circulation in the Gulf of Finland—the wide salt wedge estuary or gulf type ROFI. Journal of Marine Systems, 195, pp.1-19. doi:10.1016/j.jmarsys.2019.03.004

Maljutenko, I., Raudsepp, U., 2014. Validation of GETM model simulated long-term salinity fields in the pathway of saltwater transport in response to the Major Baltic Inflows in the Baltic Sea. Measuring and Modeling of Multi-Scale Interactions in the Marine Environment - IEEE/OES Baltic International Symposium 2014, BALTIC 2014, 6887830. doi:10.1109/BALTIC.2014.6887830

SMHI 2018, Swedish Meteorological and Hydrological Institute (SMHI) (2018). Baltic Sea - Eutrophication and Acidity aggregated datasets 1902/2017 v2018. Aggregated datasets were generated in the framework of EMODnet Chemistry III, under the support of DG MARE Call for Tender EASME/EMFF/2016/006 - lot4. doi:10.6092/595D233C-3F8C-4497-8BD2-52725CEFF96B

Column Name	Description
`text`	Raw comment text

PSL Complete Dataset (2016-2025)

kaggle.com

zip

Updated Jul 3, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Zeeshan Ahmad (2025). PSL Complete Dataset (2016-2025) [Dataset]. https://www.kaggle.com/datasets/zeeshanahmad124586/psl-complete-dataset-2016-2025

Explore at:

zip(768083 bytes)Available download formats

Dataset updated

Jul 3, 2025

Authors

Zeeshan Ahmad

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

📘 Context

The Pakistan Super League (PSL) is a premier Twenty20 cricket league in Asia, established in 2015 by the Pakistan Cricket Board (PCB). It features six franchise teams representing major cities in Pakistan, competing in a round-robin format followed by playoffs and a grand final. Known for its high-octane action, world-class players, and massive fan following, PSL has become one of the most exciting T20 leagues globally.

This dataset captures the complete history of PSL matches from its inception through March 2025, making it a valuable resource for cricket analysts, machine learning practitioners, sports journalists, and fans who want to dive deep into player and team performance trends.

📦 Content

Geography: Pakistan, UAE (Asia)
Time Period: February 4, 2016 – March 18, 2025
Unit of Analysis: Ball-by-ball records of Pakistan Super League (PSL) matches

📊 Variables

The dataset includes ball-level data and match-level summaries, making it ideal for both high-level analytics and granular delivery-by-delivery insights.

Column Name	Description
`id`	Unique identifier for each delivery
`match_id`	Unique identifier for each match
`date`	Date of the match
`season`	PSL season during which the match was played
`venue`	Stadium where the match was held
`inning`	Inning number (1 or 2)
`batting_team`	Team currently batting
`bowling_team`	Team currently bowling
`over`	Over number in the innings (0 to 19)
`ball`	Ball number within the over (1 to 6)
`batter`	Name of the batsman facing the delivery
`bowler`	Name of the bowler delivering the ball
`non_striker`	Name of the non-striking batsman
`batsman_runs`	Runs scored by the batter on that delivery
`extra_runs`	Runs awarded as extras (wide, no-ball, etc.)
`total_runs`	Total runs scored on the delivery (batsman + extras)
`extras_type`	Type of extra run (e.g., wide, no-ball, bye)
`is_wicket`	1 if a wicket fell on the delivery; 0 otherwise
`player_dismissed`	Name of the player dismissed on the delivery (if any)
`dismissal_kind`	Mode of dismissal (e.g., caught, bowled, run out)
`fielder`	Name of the fielder involved in the dismissal (if applicable)
`winner`	Team that won the match
`win_by`	Margin of victory (e.g., "wickets 6", "runs 25")
`match_type`	Stage of the match (e.g., league, eliminator, qualifier, final)
`player_of_match`	Best-performing player of the match
`umpire_1`	Name of the first on-field umpire
`umpire_2`	Name of the second on-field umpire

The ATLAS of Traffic Lights
zenodo.org
mp4, zip
Updated Feb 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rupert Polley; Nikolai Polley; Dominik Heid; Marc Heinrich; Sven Ochs; J. Marius Zöllner; Rupert Polley; Nikolai Polley; Dominik Heid; Marc Heinrich; Sven Ochs; J. Marius Zöllner (2025). The ATLAS of Traffic Lights [Dataset]. http://doi.org/10.5281/zenodo.14775869
Explore at:
zip, mp4Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.14775869
Dataset updated
Feb 12, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rupert Polley; Nikolai Polley; Dominik Heid; Marc Heinrich; Sven Ochs; J. Marius Zöllner; Rupert Polley; Nikolai Polley; Dominik Heid; Marc Heinrich; Sven Ochs; J. Marius Zöllner
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This is an older version: Please use the newest available version.

Changelog:

31. Jan. 2024: v0.1 - We released a small dataset sample. Until the full release on the 28. Feb. 2025, the annotation format may be subject to change.

ATLAS

ATLAS (Applied Traffic Light Annotation Set) is a new, publicly available dataset designed to improve traffic light detection for autonomous driving. Existing open-source datasets often omit certain traffic light states and lack camera configurations for near and far distances. To address this, ATLAS features over 33,000 images collected from three synchronized cameras—wide, medium, and tele—with varied fields of view in the German city of Karlsruhe. This setup captures traffic lights at diverse distances and angles, including difficult overhead views. Each of the dataset’s 72,998 bounding boxes is meticulously labeled for 25 unique pictogram-state classes, covering rare but critical states (e.g., red-yellow) and pictograms (straight-right, straight-left). Additional annotations include challenging conditions such as heavy rain. All data is anonymized using state-of-the-art tools. ATLAS provides a comprehensive, high-quality resource for robust traffic light detection, overcoming limitations of existing datasets.

Camera FOV [°] Resolution Images
Front-Medium 61 × 39 1920 × 1200 25,158
Front-Tele 31 × 20 1920 × 1200 5,109
Front-Wide 106 × 92 2592 × 2048 2,777

Directory Format:

We provide the dataset in the following format:

├── ATLAS
├── train
├── front_medium
├── images
├── front_medium_1722622455-950002160.jpg
├── labels
├── front_medium_1722622455-950002160.txt
├── front_tele
├── front_wide
├── test
├── ATLAS_classes.yaml
├── LICENSE
└── README.md

Annotation Format:

Each line in an annotation file describes one bounding box using five fields:

class_id x_center y_center width height

class_id: An integer indicating the class of the annotated object. The file ATLAS_classes.yaml contains human-readable names corresponding to each numeric label.

x_center, y_center: The normalized coordinates of the bounding box center, relative to the image dimensions (in the range [0,1]), where x_center is measured horizontally and y_center vertically.

width, height: The normalized width and height of the bounding box, also expressed in the range [0,1]. These values are obtained by dividing the bounding box width and height in pixels by the overall image width and height, respectively.

Terms and Conditions

The ATLAS Dataset by FZI Research Center for Information Technology is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Therefore, the Dataset is only allowed to be used for non-commercial purposes, such as teaching and research. The Licensor thus grants the End User the right to use the dataset for its own internal and non-commercial use and the purpose of scientific research only. There may be inaccuracies, although the Licensor tried and will try its best to rectify any inaccuracy once found. We invite all users to report remarks via mail at polley@fzi.de

If the dataset is used in media, a link to the Licensor’s website is to be included. In case the End User uses the dataset within research papers, the following publication should be quoted:

Polley et al.: The ATLAS of Traffic Lights: A Reliable Perception Framework for Autonomous Driving (under review)
Dataset of pdf files
kaggle.com
Updated May 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manisha717 (2024). Dataset of pdf files [Dataset]. https://www.kaggle.com/datasets/manisha717/dataset-of-pdf-files
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 1, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Manisha717
Description
The dataset consists of diverse PDF files covering a wide range of topics. These files include reports, articles, manuals, and more, spanning various fields such as science, technology, history, literature, and business. With its broad content, the dataset offers versatility for testing and various purposes, making it valuable for researchers, developers, educators, and enthusiasts alike.
t
Data from: REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic...
researchdata.tuwien.ac.at
txt, zip
Updated Jul 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee (2025). REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly [Dataset]. http://doi.org/10.48436/0ewrv-8cb44
Explore at:
zip, txtAvailable download formats
Unique identifier
https://doi.org/10.48436/0ewrv-8cb44
Dataset updated
Jul 15, 2025
Dataset provided by
TU Wien
Authors
Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 9, 2025 - Jan 14, 2025
Description
REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly

📋 Introduction

Robotic manipulation remains a core challenge in robotics, particularly for contact-rich tasks such as industrial assembly and disassembly. Existing datasets have significantly advanced learning in manipulation but are primarily focused on simpler tasks like object rearrangement, falling short of capturing the complexity and physical dynamics involved in assembly and disassembly. To bridge this gap, we present REASSEMBLE (Robotic assEmbly disASSEMBLy datasEt), a new dataset designed specifically for contact-rich manipulation tasks. Built around the NIST Assembly Task Board 1 benchmark, REASSEMBLE includes four actions (pick, insert, remove, and place) involving 17 objects. The dataset contains 4,551 demonstrations, of which 4,035 were successful, spanning a total of 781 minutes. Our dataset features multi-modal sensor data including event cameras, force-torque sensors, microphones, and multi-view RGB cameras. This diverse dataset supports research in areas such as learning contact-rich manipulation, task condition identification, action segmentation, and more. We believe REASSEMBLE will be a valuable resource for advancing robotic manipulation in complex, real-world scenarios.

✨ Key Features

Multimodality: REASSEMBLE contains data from robot proprioception, RGB cameras, Force&Torque sensors, microphones, and event cameras

Multitask labels: REASSEMBLE contains labeling which enables research in Temporal Action Segmentation, Motion Policy Learning, Anomaly detection, and Task Inversion.

Long horizon: Demonstrations in the REASSEMBLE dataset cover long horizon tasks and actions which usually span multiple steps.

Hierarchical labels: REASSEMBLE contains actions segmentation labels at two hierarchical levels.

🔴 Dataset Collection

Each demonstration starts by randomizing the board and object poses, after which an operator teleoperates the robot to assemble and disassemble the board while narrating their actions and marking task segment boundaries with key presses. The narrated descriptions are transcribed using Whisper [1], and the board and camera poses are measured at the beginning using a motion capture system, though continuous tracking is avoided due to interference with the event camera. Sensory data is recorded with rosbag and later post-processed into HDF5 files without downsampling or synchronization, preserving raw data and timestamps for future flexibility. To reduce memory usage, video and audio are stored as encoded MP4 and MP3 files, respectively. Transcription errors are corrected automatically or manually, and a custom visualization tool is used to validate the synchronization and correctness of all data and annotations. Missing or incorrect entries are identified and corrected, ensuring the dataset’s completeness. Low-level Skill annotations were added manually after data collection, and all labels were carefully reviewed to ensure accuracy.

📑 Dataset Structure

The dataset consists of several HDF5 (.h5) and JSON (.json) files, organized into two directories. The poses directory contains the JSON files, which store the poses of the cameras and the board in the world coordinate frame. The data directory contains the HDF5 files, which store the sensory readings and annotations collected as part of the REASSEMBLE dataset. Each JSON file can be matched with its corresponding HDF5 file based on their filenames, which include the timestamp when the data was recorded. For example, 2025-01-09-13-59-54_poses.json corresponds to 2025-01-09-13-59-54.h5.

The structure of the JSON files is as follows:

{"Hama1": [ [x ,y, z], [qx, qy, qz, qw] ], "Hama2": [ [x ,y, z], [qx, qy, qz, qw] ], "DAVIS346": [ [x ,y, z], [qx, qy, qz, qw] ], "NIST_Board1": [ [x ,y, z], [qx, qy, qz, qw] ] }

[x, y, z] represent the position of the object, and [qx, qy, qz, qw] represent its orientation as a quaternion.

The HDF5 (.h5) format organizes data into two main types of structures: datasets, which hold the actual data, and groups, which act like folders that can contain datasets or other groups. In the diagram below, groups are shown as folder icons, and datasets as file icons. The main group of the file directly contains the video, audio, and event data. To save memory, video and audio are stored as encoded byte strings, while event data is stored as arrays. The robot’s proprioceptive information is kept in the robot_state group as arrays. Because different sensors record data at different rates, the arrays vary in length (signified by the N_xxx variable in the data shapes). To align the sensory data, each sensor’s timestamps are stored separately in the timestamps group. Information about action segments is stored in the segments_info group. Each segment is saved as a subgroup, named according to its order in the demonstration, and includes a start timestamp, end timestamp, a success indicator, and a natural language description of the action. Within each segment, low-level skills are organized under a low_level subgroup, following the same structure as the high-level annotations.

📁

The splits folder contains two text files which list the h5 files used for the traning and validation splits.

📌 Important Resources

The project website contains more details about the REASSEMBLE dataset. The Code for loading and visualizing the data is avaibile on our github repository.

📄 Project website: https://tuwien-asl.github.io/REASSEMBLE_page/
💻 Code: https://github.com/TUWIEN-ASL/REASSEMBLE

⚠️ File comments

Below is a table which contains a list records which have any issues. Issues typically correspond to missing data from one of the sensors.

Recording Issue
2025-01-10-15-28-50.h5 hand cam missing at beginning
2025-01-10-16-17-40.h5 missing hand cam
2025-01-10-17-10-38.h5 hand cam missing at beginning
2025-01-10-17-54-09.h5 no empty action at

UWMGI Image Segmentation TFRecords

kaggle.com

zip

Updated Jun 17, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

tt195361 (2022). UWMGI Image Segmentation TFRecords [Dataset]. https://www.kaggle.com/tt195361/uwmgi-image-segmentation-tfrecords

Explore at:

zip(1590547976 bytes)Available download formats

Dataset updated

Jun 17, 2022

Authors

tt195361

Description

This dataset is a collection of TFRecord files to train the models for the UW-Madison GI Tract Image Segmentation competition, specifically on TPU. Each TFRecord file contains the following data:

Name	Type	Description
id	bytes	sample ID taken from the 'id' column in 'train.csv', utf-8 encoded.
case number	int64	case number taken from 'id' at caseNNN
day number	int64	day number taken from 'id' at dayNN
slice number	int64	slice number taken from 'id' at slice_NNNN
image	bytes	numpy format image bytes read from the associated file
mask	bytes	PNG format mask bytes generated from the 'segmentation' column in 'train.csv'
fold	int64	fold number that this sample belongs to
height	int64	slice height taken from the file name
width	int64	slice width taken from the file name
space height	float32	pixel spacing height taken from the file name
space width	float32	pixel spacing width taken from the file name
large bowel dice coef	float32	how well the model predicted for large bowel
small bowel dice coef	float32	how well the model predicted for small bowel
stomach dice coef	float32	how well the model predicted for stomach
slice count	int64	number of slices for case/day

A sample format definition to read the record is as follows.

  TFREC_FORMAT = {
    'id': tf.io.FixedLenFeature([], tf.string),
    'case_no': tf.io.FixedLenFeature([], tf.int64),
    'day_no': tf.io.FixedLenFeature([], tf.int64),
    'slice_no': tf.io.FixedLenFeature([], tf.int64),
    'image': tf.io.FixedLenFeature([], tf.string),
    'mask': tf.io.FixedLenFeature([], tf.string),
    'fold': tf.io.FixedLenFeature([], tf.int64),
    'height': tf.io.FixedLenFeature([], tf.int64),
    'width': tf.io.FixedLenFeature([], tf.int64),
    'space_h': tf.io.FixedLenFeature([], tf.float32),
    'space_w': tf.io.FixedLenFeature([], tf.float32),
    'large_bowel_dice_coef': tf.io.FixedLenFeature([], tf.float32),
    'small_bowel_dice_coef': tf.io.FixedLenFeature([], tf.float32),
    'stomach_dice_coef': tf.io.FixedLenFeature([], tf.float32),
    'slice_count': tf.io.FixedLenFeature([], tf.int64),
  }

Here is the notebook used to make this dataset. Here is the notebook to train a model by using this dataset.

World Population Data

kaggle.com

zip

Updated Jan 1, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Sazidul Islam (2024). World Population Data [Dataset]. https://www.kaggle.com/datasets/sazidthe1/world-population-data/discussion

Explore at:

zip(14672 bytes)Available download formats

Dataset updated

Jan 1, 2024

Authors

Sazidul Islam

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Area covered

World

Description

Context

The world's population has undergone remarkable growth, exceeding 7.5 billion by mid-2019 and continuing to surge beyond previous estimates. Notably, China and India stand as the two most populous countries, with China's population potentially facing a decline while India's trajectory hints at surpassing it by 2030. This significant demographic shift is just one facet of a global landscape where countries like the United States, Indonesia, Brazil, Nigeria, and others, each with populations surpassing 100 million, play pivotal roles.

The steady decrease in growth rates, though, is reshaping projections. While the world's population is expected to exceed 8 billion by 2030, growth will notably decelerate compared to previous decades. Specific countries like India, Nigeria, and several African nations will notably contribute to this growth, potentially doubling their populations before rates plateau.

Content

This dataset provides comprehensive historical population data for countries and territories globally, offering insights into various parameters such as area size, continent, population growth rates, rankings, and world population percentages. Spanning from 1970 to 2023, it includes population figures for different years, enabling a detailed examination of demographic trends and changes over time.

Dataset

Structured with meticulous detail, this dataset offers a wide array of information in a format conducive to analysis and exploration. Featuring parameters like population by year, country rankings, geographical details, and growth rates, it serves as a valuable resource for researchers, policymakers, and analysts. Additionally, the inclusion of growth rates and world population percentages provides a nuanced understanding of how countries contribute to global demographic shifts.

This dataset is invaluable for those interested in understanding historical population trends, predicting future demographic patterns, and conducting in-depth analyses to inform policies across various sectors such as economics, urban planning, public health, and more.

Structure

This dataset (world_population_data.csv) covering from 1970 up to 2023 includes the following columns:

Column Name	Description
`Rank`	Rank by Population
`CCA3`	3 Digit Country/Territories Code
`Country`	Name of the Country
`Continent`	Name of the Continent
`2023 Population`	Population of the Country in the year 2023
`2022 Population`	Population of the Country in the year 2022
`2020 Population`	Population of the Country in the year 2020
`2015 Population`	Population of the Country in the year 2015
`2010 Population`	Population of the Country in the year 2010
`2000 Population`	Population of the Country in the year 2000
`1990 Population`	Population of the Country in the year 1990
`1980 Population`	Population of the Country in the year 1980
`1970 Population`	Population of the Country in the year 1970
`Area (km²)`	Area size of the Country/Territories in square kilometer
`Density (km²)`	Population Density per square kilometer
`Growth Rate`	Population Growth Rate by Country
`World Population Percentage`	The population percentage by each Country

Acknowledgment

The primary dataset was retrieved from the World Population Review. I sincerely thank the team for providing the core data used in this dataset.

Landsat to Sentinel-2 (LS2S2), a dataset for the fusion of joint Landsat and...

zenodo.org

pdf, txt, zip

Updated Oct 29, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Julien MICHEL; Julien MICHEL (2025). Landsat to Sentinel-2 (LS2S2), a dataset for the fusion of joint Landsat and Sentinel-2 Satellite Image Time Series [Dataset]. http://doi.org/10.5281/zenodo.15471890

Explore at:

zip, txt, pdfAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15471890

Dataset updated

Oct 29, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Julien MICHEL; Julien MICHEL

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset comprises joint Sentinel-2 and Landsat-8 and Landsat-9 Satellite Image Time Series (SITS) over year 2022. It is composed of 64 Areas Of Interest (AOIs) for the training set and 41 AOIs for the testing set, each covering an area of 9.9x9.9 km² (990x990 pixels for Sentinel-2 and 330x330 pixels for Landsat) covering Europe as well as a few spots in west Africa and north of South America. All dates with more than 25% of clear pixels over the AOI are included in the dataset. This yields a total of 2 984 Sentinel-2 images and 1 609 Landsat-8 and 9 images in the training set. Additional statistics about the number of dates per AOI for each sensor are presented in the following table:

	Sentinel-2				Landsat
	Total	Average	Min	Max	Total	Average	Min	Max
Train split	2984	44	11	94	1581	26	0	61
Test split	1609	39	6	98	736	18	1	56

Important Notice : A multi-year, wolrdwide complement to this dataset called LS2S2MYWW is available here: https://doi.org/10.6096/1029 (it is too large to be hosted on Zenodo).

For each sensor, Top-of-Canopy surface reflectance from level 2 products are used. The spectral bands included in the dataset are presented in the following table. It can be observed that the Landsat sensor does not have Red Edge bands or wide Infra-Red, and conversely Sentinel-2 sensor does not retrieve Land Surface Temperature (LST).

Sentinel-2		Landsat		Description
Band	Resolution (m)	Band	Resolution (m)
		B1	30.	Deep blue
B02	10.	B2	30.	Blue
B03	10.	B3	30.	Green
B04	10.	B4	30.	Red
B05, B06, B07	20.			Red Edge
B08	20.	B5	30.	Near Infra-Red
B8a	10.			Wide Near Infra-Red
B11, B12	20.	B6, B7	30.	Short Wavelength Infra-Red
		B10	100.	Land Surface Temperature

In addition to the spectral bands, corresponding quality masks have been used to derive a validity mask for each date of each sensor. This dataset has been gathered through the OpenEO API, in the frame of following work:

Julien Michel, Jordi Inglada. Temporal Attention Multi-Resolution Fusion of Satellite Image Time-Series, applied to Landsat-8 and Sentinel-2: all bands, any time, at best spatial resolution. 2025. ⟨hal-05101526⟩

The source code associated with the paper, including the download script that created the dataset, is available here: https://github.com/Evoland-Land-Monitoring-Evolution/tamrfsits

File organization

Main zip files

Two main zip files are provided: ls2s2_train.zip contains the training split, and ls2s2_test.zip contains the test split. Both zip files contains one internal zip file per AOI, organized as follows.

Note that we provide test_31TCJ_12.zip as a sample for previewing the content of the dataset before downloading the train or test split.

The dataset comprises one zip file per AOI. The naming pattern for the zip file is as follows {test/train}_{mgrs_tile}_{subtile}.zip. The {test/train} field indicates if the file is part of the training or testing set. The {mgrs_tile} field correspond to the MGRS tile from which the AOI has been sampled. The {subtile} field indicate which su-btile of the MGRS tile has been sampled. Sub-tiles correspond to 1024x1204 internal JPEG2000 tiles of the Sentinel-2 product. Their numbering follows the lexicographical order (columns then rows).

Each zip file contains the following layout:

{train/test}/{mgrs_tile}_{subtile}/
{mgrs_tile}_{subtile}.json
{mgrs_tile}_{subtile}_sentinel2_synopsis.png
{mgrs_tile}_{subtile}_landsat_synopsis.png
sentinel2/
index.csv
2022mmdd/
sentinel2_mask_2022mmdd.tif
sentinel2_bands_2022mmdd.tif
...
landsat/
index.csv
index_pan.csv
2022mmdd/
landsat_mask_2022mmdd.tif
landsat_bands_2022mmdd.tif
landsat_pan_mask_2022mmdd.tif
landsat_pan_2022mmdd.tif
...

Files description

Here is a description of the different files:

<td style="height:

File name	Description
`{mgrs_tile}_{subtile}.json`	A json file describing the AOI.
`{mgrs_tile}_{subtile}_sentinel2_synopsis.png`	A synopsis PNG file allowing to see all Sentinel-2 images and mask of the AOI at a glance.
`{mgrs_tile}_{subtile}_landsat_synopsis.png`	A synopsis PNG file allowing to see all Landsat images and mask of the AOI at a glance.
`index.csv`	The csv file indexing the Sentinel-2 or Landsat data for the AOI.
`index_pan.csv`	The csv file indexing the Landsat panchromatic data for the AOI.
`sentinel2_mask_2022mmdd.tif`	990x990 pixels GeoTIFF file containing the validity mask for the current date (0 for valid and 1 for invalid).
`sentinel2_bands_2022mmdd.tif`	990x990 pixels GeoTIFF file containing the Sentinel-2 spectral bands are in surface reflectance * 10 000. Band order is B2, B3, B4, B5, B6, B7, B8, B8A, B11, B12. 20m bands are up-sampled to 10m resolution by means of bicubic interpolation. No data pixels have -10 000 value.
`landsat_mask_2022mmdd.tif`	330x330 pixels GeoTIFF file containing the validity mask for the current date (0 for valid and 1 for invalid). Spatial resolution is 30m.

Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Loist, Skadi; Samoilova, Evgenia (Zhenya) (2024). Film Circulation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7887671

Film Circulation dataset

Explore at:

Dataset updated

Jul 12, 2024

Dataset provided by

Film University Babelsberg KONRAD WOLF

Authors

Loist, Skadi; Samoilova, Evgenia (Zhenya)

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.

The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This

Clear search

Close search

Google apps

Main menu

Camera	FOV [°]	Resolution	Images
Front-Medium	61 × 39	1920 × 1200	25,158
Front-Tele	31 × 20	1920 × 1200	5,109
Front-Wide	106 × 92	2592 × 2048	2,777

Recording	Issue
2025-01-10-15-28-50.h5	hand cam missing at beginning
2025-01-10-16-17-40.h5	missing hand cam
2025-01-10-17-10-38.h5	hand cam missing at beginning
2025-01-10-17-54-09.h5	no empty action at

Film Circulation dataset

Louisville Metro KY - Open Data Data Set Inventory Updated for 2022

Global scientific academies Dataset

UNI-CEN Standardized Census Data Table - Census Division (CD) - 1986 - Long...

Japanese Chain of Thought Prompt & Response Dataset

Dataset Content

Prompt Diversity

Response Formats

Data Format and Annotation Details

Dog Breeds

Data Dictionary

breed_traits_long.csv

breed_traits.csv

trait_description.csv

breed_rank_all.csv

AI in Business – EU contextual regulations & future workforce survey data...

Unified database of ozonesounding profiles

Data from: cigChannel: A large-scale 3D seismic dataset with labeled...

Assorted channel subset of the cigChannel dataset (V1.0)

Portals to the other subsets:

Harmonized global datasets of soil carbon and heterotrophic respiration from...

YouTube Comments Data

🧠 YouTube Comments Dataset: Doctor Mike vs 20 Anti-Vaxxers | Jubilee

📌 Dataset Overview

🧪 Key Use Cases

📁 Dataset Columns

💡 Why This Dataset?

🧰 Suggested Tasks

📎 A Note on Ethics

🔗 Source Acknowledgment

Data for A method for assessment of the general circulation model quality...

PSL Complete Dataset (2016-2025)

📘 Context

📦 Content

📊 Variables

The ATLAS of Traffic Lights

This is an older version: Please use the newest available version.Changelog:

ATLAS

Directory Format:

Annotation Format:

Terms and Conditions

Dataset of pdf files

Data from: REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic...

REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly

📋 Introduction

✨ Key Features

🔴 Dataset Collection

📑 Dataset Structure

📌 Important Resources

⚠️ File comments

UWMGI Image Segmentation TFRecords

World Population Data

Context

Content

Dataset

Structure

Acknowledgment

Landsat to Sentinel-2 (LS2S2), a dataset for the fusion of joint Landsat and...

Description

File organization

Main zip files

Files description

Film Circulation dataset

`breed_traits_long.csv`

`breed_traits.csv`

`trait_description.csv`

`breed_rank_all.csv`

This is an older version: Please use the newest available version.

Changelog: