19 datasets found
  1. Z

    Film Circulation dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Loist, Skadi; Samoilova, Evgenia (Zhenya) (2024). Film Circulation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7887671
    Explore at:
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Film University Babelsberg KONRAD WOLF
    Authors
    Loist, Skadi; Samoilova, Evgenia (Zhenya)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

    A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

    Please cite this when using the dataset.

    Detailed description of the dataset:

    1 Film Dataset: Festival Programs

    The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

    The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

    The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

    2 Survey Dataset

    The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

    The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

    The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

    3 IMDb & Scripts

    The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

    The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

    The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

    The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

    The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

    The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

    The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

    The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

    The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

    The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

    The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

    The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

    The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

    The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

    The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

    The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

    The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

    The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

    The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

    4 Festival Library Dataset

    The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

    The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.

    The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This

  2. d

    Louisville Metro KY - Open Data Data Set Inventory Updated for 2022

    • datasets.ai
    • data.louisvilleky.gov
    • +4more
    0, 15, 21, 25, 3, 47 +3
    Updated Apr 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Louisville Metro Government (2023). Louisville Metro KY - Open Data Data Set Inventory Updated for 2022 [Dataset]. https://datasets.ai/datasets/louisville-metro-ky-open-data-data-set-inventory-updated-for-2022-286a7
    Explore at:
    57, 0, 15, 3, 21, 8, 47, 25, 53Available download formats
    Dataset updated
    Apr 13, 2023
    Dataset authored and provided by
    Louisville Metro Government
    Area covered
    Louisville, Kentucky
    Description

    This data aligns with WWC Certification requirements, and serves as the basis for our data warehouse and open data roadmap. It's a continual work in progress across all departments.

    Louisville Metro Technology Services builds data and technology platforms to ready our government for our community’s digital future.

    Data Dictionary:

    Field Name

    Description

    Dataset Name

    The official title of the dataset as listed in the inventory.

    Brief Description of Data

    A short summary explaining the contents and purpose of the dataset.

    Data Source

    The origin or system from which the data is collected or generated.

    Home Department

    The primary department responsible for the dataset.

    Home Department Division

    The specific division within the department that manages the dataset.

    Data Steward (Business) Name

    The name of person responsible for the dataset’s accuracy and relevance.

    Data Custodian (Technical) Name)

    The technical contact responsible for maintaining and managing the dataset infrastructure.

    Data Classification

    The sensitivity level of the data (e.g., Public, Internal, Confidential)

    Data Format

    The file format(s) in which the dataset is available (e.g., CSV, JSON, Shapefile).

    Frequency of Data Change

    How often the dataset is updated (e.g., Daily, Weekly, Monthly, Annually).

    Time Spam

    The overall time period the dataset covers.

    Start Date

    The beginning date of the data

  3. S

    Global scientific academies Dataset

    • scidb.cn
    Updated Nov 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    chen xiaoli (2024). Global scientific academies Dataset [Dataset]. http://doi.org/10.57760/sciencedb.14674
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 18, 2024
    Dataset provided by
    Science Data Bank
    Authors
    chen xiaoli
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset was generated as part of the study aimed at profiling global scientific academies, which play a significant role in promoting scholarly communication and scientific progress. Below is a detailed description of the dataset:Data Generation Procedures and Tools: The dataset was compiled using a combination of web scraping, manual verification, and data integration from multiple sources, including Wikipedia categories,member of union of scientific organizations, and web searches using specific query phrases (e.g., "country name + (academy OR society) AND site:.country code"). The records were enriched by cross-referencing data from the Wikidata API, the VIAF API, and the Research Organisation Registry (ROR). Additional manual curation ensured accuracy and consistency.Temporal and Geographical Scopes: The dataset covers scientific academies from a wide temporal scope, ranging from the 15th century to the present. The geographical scope includes academies from all continents, with emphasis on both developed and post-developing countries. The dataset aims to capture the full spectrum of scientific academies across different periods of historical development.Tabular Data Description: The dataset comprises a total of 301 academy records and 14,008 website navigation sections. Each row in the dataset represents a single scientific academy, while the columns describe attributes such as the academy’s name, founding date, location (city and country), website URL, email, and address.Missing Data: Although the dataset offers comprehensive coverage, some entries may have missing or incomplete fields. For instance, section was not available for all records.Data Errors and Error Ranges: The data has been verified through manual curation, reducing the likelihood of errors. However, the use of crowd-sourced data from platforms like Wikipedia introduces potential risks of outdated or incomplete information. Any errors are likely minor and confined to fields such as navigation menu classifications, which may not fully reflect the breadth of an academy's activities.Data Files, Formats, and Sizes: The dataset is provided in CSV format and JSON format, ensuring compatibility with a wide range of software applications, including Microsoft Excel, Google Sheets, and programming languages such as Python (via libraries like pandas).This dataset provides a valuable resource for further research into the organizational behaviors, geographic distribution, and historical significance of scientific academies across the globe. It can be used for large-scale analyses, including comparative studies across different regions or time periods.Any feedback on the data is welcome! Please contact the maintaner of the dataset!If you use the data, please cite the following paper:Xiaoli Chen and Xuezhao Wang. 2024. Profiling Global Scientific Academies. In The 2024 ACM/IEEE Joint Conference on Digital Libraries (JCDL ’24), December 16–20, 2024, Hong Kong, China. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3677389.3702582

  4. B

    UNI-CEN Standardized Census Data Table - Census Division (CD) - 1986 - Long...

    • borealisdata.ca
    • dataone.org
    Updated Apr 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UNI-CEN Project (2023). UNI-CEN Standardized Census Data Table - Census Division (CD) - 1986 - Long Format (DTA) (Version 2023-03) [Dataset]. http://doi.org/10.5683/SP3/QVOT0Y
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 4, 2023
    Dataset provided by
    Borealis
    Authors
    UNI-CEN Project
    License

    https://borealisdata.ca/api/datasets/:persistentId/versions/4.0/customlicense?persistentId=doi:10.5683/SP3/QVOT0Yhttps://borealisdata.ca/api/datasets/:persistentId/versions/4.0/customlicense?persistentId=doi:10.5683/SP3/QVOT0Y

    Time period covered
    Jan 1, 1986
    Area covered
    Canada
    Description

    UNI-CEN Standardized Census Data Tables contain Census data that have been reformatted into a common table format with standardized variable names and codes. The data are provided in two tabular formats for different use cases. "Long" tables are suitable for use in statistical environments, while "wide" tables are commonly used in GIS environments. The long tables are provided in Stata Binary (dta) format, which is readable by all statistics software. The wide tables are provided in comma-separated values (csv) and dBase 3 (dbf) formats with codebooks. The wide tables are easily joined to the UNI-CEN Digital Boundary Files. For the csv files, a .csvt file is provided to ensure that column data formats are correctly formatted when importing into QGIS. A schema.ini file does the same when importing into ArcGIS environments. As the DBF file format supports a maximum of 250 columns, tables with a larger number of variables are divided into multiple DBF files. For more information about file sources, the methods used to create them, and how to use them, consult the documentation at https://borealisdata.ca/dataverse/unicen_docs. For more information about the project, visit https://observatory.uwo.ca/unicen.

  5. F

    Japanese Chain of Thought Prompt & Response Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Japanese Chain of Thought Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/japanese-chain-of-thought-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Welcome to the Japanese Chain of Thought prompt-response dataset, a meticulously curated collection containing 3000 comprehensive prompt and response pairs. This dataset is an invaluable resource for training Language Models (LMs) to generate well-reasoned answers and minimize inaccuracies. Its primary utility lies in enhancing LLMs' reasoning skills for solving arithmetic, common sense, symbolic reasoning, and complex problems.

    Dataset Content

    This COT dataset comprises a diverse set of instructions and questions paired with corresponding answers and rationales in the Japanese language. These prompts and completions cover a broad range of topics and questions, including mathematical concepts, common sense reasoning, complex problem-solving, scientific inquiries, puzzles, and more.

    Each prompt is meticulously accompanied by a response and rationale, providing essential information and insights to enhance the language model training process. These prompts, completions, and rationales were manually curated by native Japanese people, drawing references from various sources, including open-source datasets, news articles, websites, and other reliable references.

    Our chain-of-thought prompt-completion dataset includes various prompt types, such as instructional prompts, continuations, and in-context learning (zero-shot, few-shot) prompts. Additionally, the dataset contains prompts and completions enriched with various forms of rich text, such as lists, tables, code snippets, JSON, and more, with proper markdown format.

    Prompt Diversity

    To ensure a wide-ranging dataset, we have included prompts from a plethora of topics related to mathematics, common sense reasoning, and symbolic reasoning. These topics encompass arithmetic, percentages, ratios, geometry, analogies, spatial reasoning, temporal reasoning, logic puzzles, patterns, and sequences, among others.

    These prompts vary in complexity, spanning easy, medium, and hard levels. Various question types are included, such as multiple-choice, direct queries, and true/false assessments.

    Response Formats

    To accommodate diverse learning experiences, our dataset incorporates different types of answers depending on the prompt and provides step-by-step rationales. The detailed rationale aids the language model in building reasoning process for complex questions.

    These responses encompass text strings, numerical values, and date and time formats, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.

    Data Format and Annotation Details

    This fully labeled Japanese Chain of Thought Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt complexity, prompt category, domain, response, rationale, response type, and rich text presence.

    Quality and Accuracy

    Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses and rationales are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.

    The Japanese version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.

    Continuous Updates and Customization

    The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom chain of thought prompt completion data tailored to specific needs, providing flexibility and customization options.

    License

    The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Japanese Chain of Thought Prompt Completion Dataset to enhance the rationale and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.

  6. Dog Breeds

    • kaggle.com
    zip
    Updated Sep 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sujay Kapadnis (2023). Dog Breeds [Dataset]. https://www.kaggle.com/datasets/sujaykapadnis/dog-breeds/code
    Explore at:
    zip(27212 bytes)Available download formats
    Dataset updated
    Sep 22, 2023
    Authors
    Sujay Kapadnis
    Description

    The data comes from the American Kennel Club courtesy breed_traits - trait information on each dog breed and scores for each trait (wide format) trait_description - long descriptions of each trait and values corresponding to Trait_Score breed_rank_all- popularity of dog breeds by AKC registration statistics from 2013-2020

    Data Dictionary

    breed_traits_long.csv

    variableclassdescription
    BreedcharacterDog Breed
    TraitcharacterName of trait/characteristic
    Trait_ScorecharacterPlacement on scale of 1-5 for the trait, with the exception of a description for coat type and length

    breed_traits.csv

    variableclassdescription
    BreedcharacterDog Breed
    Affectionate With FamilycharacterPlacement on scale of 1-5 for the breed's tendancy to be "Affectionate With Family" (Trait_Score)
    Good With Young ChildrencharacterPlacement on scale of 1-5 for the breed's tendancy to be "Good With Young Children" (Trait_Score)
    Good With Other DogscharacterPlacement on scale of 1-5 for the breed's tendancy to be "Good With Other Dogs" (Trait_Score)
    Shedding LevelcharacterPlacement on scale of 1-5 for the breed's "Shedding Level" (Trait_Score)
    Coat Grooming FrequencycharacterPlacement on scale of 1-5 for the breed's "Coat Grooming Frequency" (Trait_Score)
    Drooling LevelcharacterPlacement on scale of 1-5 for the breed's "Drooling Level" (Trait_Score)
    Coat TypecharacterDescription of the breed's coat type (Trait_Score)
    Coat LengthcharacterDescription of the breed's coat length (Trait_Score)
    Openness To StrangerscharacterPlacement on scale of 1-5 for the breed's tendancy to be open to strangers (Trait_Score)
    Playfulness LevelcharacterPlacement on scale of 1-5 for the breed's tendancy to be playful (Trait_Score)
    Watchdog/Protective NaturecharacterPlacement on scale of 1-5 for the breed's "Watchdog/Protective Nature" (Trait_Score)
    Adaptability LevelcharacterPlacement on scale of 1-5 for the breed's tendancy to be adaptable (Trait_Score)
    Trainability LevelcharacterPlacement on scale of 1-5 for the breed's tendancy to be adaptable (Trait_Score)
    Energy LevelcharacterPlacement on scale of 1-5 for the breed's "Energy Level" (Trait_Score)
    Barking LevelcharacterPlacement on scale of 1-5 for the breed's "Barking Level" (Trait_Score)
    Mental Stimulation NeedscharacterPlacement on scale of 1-5 for the breed's "Mental Stimulation Needs" (Trait_Score)

    trait_description.csv

    variableclassdescription
    TraitcharacterDog Breed
    Trait_1characterValue corresponding to Trait when Trait_Score = 1
    Trait_5characterValue corresponding to Trait when Trait_Score = 5
    DescriptioncharacterLong description of trait

    breed_rank_all.csv

    variableclassdescription
    BreedcharacterDog Breed
    2013 RankcharacterPopularity of breed based on AKC registration statistics in 2013
    2014 RankcharacterPopularity of breed based on AKC registration statistics in 2014
    2015 RankcharacterPopularity of breed based on AKC registration statistics in 2015
    2016 RankcharacterPopularity of breed based on AKC registration statistics in 2016
    2017 RankcharacterPopularity of breed based on AKC registration statistics in 2017
    2018 RankcharacterPopularity of breed based on AKC registration statistics in 2018
    2019 RankcharacterPopularity of breed based on AKC registration statistics in 2019
    2020 RankcharacterPopularity of breed based on AKC registration statistics in 2020
    linkscharacterLink to the dog breed's AKC webpage
    ImagecharacterLink to image of dog breed
  7. z

    AI in Business – EU contextual regulations & future workforce survey data...

    • zenodo.org
    Updated Sep 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio Clim; Antonio Clim (2025). AI in Business – EU contextual regulations & future workforce survey data (dataset) [Dataset]. http://doi.org/10.5281/zenodo.17022049
    Explore at:
    Dataset updated
    Sep 1, 2025
    Dataset provided by
    Zenodo
    Authors
    Antonio Clim; Antonio Clim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Nov 15, 2023
    Description
    DATASET OVERVIEW

    This dataset is a CSV file containing responses from an online survey titled “AI in Business – EU Contextual Regulations and Future Workforce Impacts.” The survey ran from late 2023 through 2025, gathering over 300 responses from participants (primarily in Romania and other EU regions). Each row in the CSV represents one respondent’s answers. Questions span multiple-choice and Likert scale responses, divided into three sections: (A) demographics and background, (B) perceptions of AI in business and relevant EU regulations, and (C) motivations and experiences with AI adoption in business. The questionnaire was provided in both English and Romanian to maximize clarity and participation in a bilingual context[1]. Table 1 below summarizes the structure of the dataset, listing each field (column) and its contents or question topic.

    Table 1: Survey Dataset Structure

    <tr

    Timestamp Date and time when the response was submitted.

    Email address Email address of the respondent.

    Age group Respondent's age group (e.g., 18–24, 25–34, etc.).

    Gender Gender of the respondent (Female, Male, Prefer not to say).

    Highest level of education Highest educational attainment of the respondent (High school or less; Bachelor’s/Master’s; Doctorate or above; Vocational; Prefer not to say).

    Current job role Current job role or position of the respondent (Leadership/management; Professional/technical; Administrative/support; Sales/marketing; Student; Retired; Unemployed).

    Sector Industry sector in which the respondent currently works (e.g., Technology, Healthcare, Finance, Education, Manufacturing, Public, Retail, Other).

    Years of experience Number of years the respondent has been working in their current field.

    AI knowledge Self-rated knowledge and understanding of AI technology (No knowledge; Beginner; Intermediate; Advanced; Expert).

    Region Region (in Romania or EU) where the respondent currently resides.

    AI familiarity Familiarity with the concept of AI in a business context (1 = Not very familiar, 10 = Very familiar).

    AI improves operations Belief that AI can significantly improve business operations (1 = Strongly disagree, 10 = Strongly agree).

    EU AI regulations too restrictive Opinion on whether current EU regulations on AI in business are too restrictive (1 = Strongly disagree, 10 = Strongly agree).

    AI impact on job creation Perceived impact of AI on job creation in the EU (1 = Not important, 10 = Significantly positive).

    AI job losses Expectation that AI will lead to job losses in the EU (1 = Definitely will not, 10 = Definitely will).

    EU ethical AI leadership Belief that the EU is well-positioned to lead globally in ethical AI practices (1 = Strongly disagree, 10 = Strongly agree).

    Data privacy regs effective Perception of how effectively EU regulations address data privacy concerns in AI (1 = Very ineffectively, 10 = Very effectively).

    AI in decision-making stance Stance on AI’s role in business decision-making processes (1 = Fully oppose, 10 = Fully support).

    Invest in AI education Support for investing more in AI education and workforce training (1 = Definitely no, 10 = Definitely yes).

    AI creates new jobs Belief that AI can lead to new types of jobs not existing today (1 = Strongly do not believe, 10 = Strongly believe).

    Transparency of AI processes Perception of how transparent AI-driven processes are in businesses (1 = Completely opaque, 10 = Completely transparent).

    Confidence in AI data handling Confidence in AI’s ability to ethically handle sensitive data (1 = Not confident at all, 10 = Very confident).

    AI widens SME gap Belief that AI technologies will widen the gap between large corporations and SMEs (1 = Strongly disagree, 10 = Strongly agree).

    Future AI outlook (EU) Outlook on the future of AI in EU business over the next decade (1 = Extremely pessimistic, 5 = Extremely optimistic).

    AI integration cost perception Perceived cost of integrating AI into operations for SMEs (1 = Very affordable, 10 = Extremely costly).

    Primary motivation for AI adoption Primary motivation for adopting AI in the business (e.g., Improving efficiency; Customer experience; Competitive advantage; Cost reduction).

    AI helps navigate uncertainties Belief that AI can help the business navigate market uncertainties more effectively (1 = Strongly disagree, 10 = Strongly agree).

    Awareness of EU AI regulations Self-assessed understanding and awareness of EU AI regulations and their impact (1 = Not aware, 10 = Fully aware).

    Challenges in data access for AI Whether the business has faced challenges in accessing or analyzing data for AI applications (Frequently; Occasionally; Rarely; Never; Not applicable).

    AI critical for SME future Perceived criticality of AI in shaping future strategies and business models for SMEs (Extremely critical; Important; Somewhat important; Unimportant; Unsure).

    Plan to invest in AI (1-2 yrs) Plan to invest in AI technologies or tools in the next 1–2 years (Yes, definitely; Considering; Unsure; Unlikely; No plans).

    Support needed for AI adoption Type of support or resources perceived to most help SMEs adopt AI effectively (Financial grants; Training resources; Government policies; Technical assistance).

    AI impact on job roles Perceived impact of AI on employee roles in the business (Create new roles; Augment existing roles; Replace certain roles; No significant impact; Unsure).

  8. Unified database of ozonesounding profiles

    • zenodo.org
    zip
    Updated Sep 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabrizio Marra; Fabrizio Marra; FABIO MADONNA; FABIO MADONNA; Emanuele Tramutola; Emanuele Tramutola (2025). Unified database of ozonesounding profiles [Dataset]. http://doi.org/10.5281/zenodo.17094116
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 12, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Fabrizio Marra; Fabrizio Marra; FABIO MADONNA; FABIO MADONNA; Emanuele Tramutola; Emanuele Tramutola
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2025
    Description

    The unified database of ozonesounding profiles was obtained through the merging of three existing ozonesounding datasets, provided by the Southern Hemisphere Additional OZonesondes (SHADOZ), the Network for the Detection of Atmospheric Composition Change (NDACC), and the World Ozone and Ultraviolet Radiation Data Centre (WOUDC).

    Only a selected set of variables of interest, both data and metadata, were considered to build the unified dataset, due to the heterogeneous formats and varying levels of detail provided by each network, even when referring to measurements shared across different initiatives. These variables are listed in the following Table.

    Standard name

    Description

    Unit

    idstation

    The name of the station.

    N.A.

    location_latitude

    Latitude of station.

    deg

    location_longitude

    Longitude of station.

    deg

    lacation_height

    Height is defined as the altitude, elevation, or height of the defined platform + instrument above sea level.

    m

    date_of_observation

    Date when the ozonesonde was launched (in format yyyy-mm-dd hh:mm:ss with time zone).

    N.A.

    time

    Elapsed flight time since released.

    s

    pressure

    Atmospheric pressure of each level in Pascals.

    Pa

    geop_alt

    Geopotential height in meters.

    m

    temperature

    Air temperature in degrees Kelvin.

    K

    relative_humidity

    Relative humidity in 1.

    1

    wind_speed

    Wind speed in meters per seconds.

    m/s

    wind_direction

    Wind direction in degrees.

    deg

    latitude

    Observation latitude (during the flight).

    deg

    longitude

    Observation longitude (during the flight).

    deg

    altitude

    Height of sensor above local ground or sea surface. Positive values for above surface (e.g., sondes), negative for below (e.g., xbt). For visual observations, the height of the visual observing platform.

    m (a. s. l.)

    sample_temperature

    Temperature where sample is measured in degrees Kelvin.

    K

    o3_partial_pressure

    The level partial pressure of ozone in Pascals.

    Pa

    ozone_concentration

    The level mixing ratio of ozone in ppmv.

    ppmv

    ozone_partial_pressure_total_uncertainty

    Total uncertainty in the calculation of the ozone partial pressure as a composite of the individual uncertainty contribution. Uncertainties due to systematic bias are assumed as random and follow a random normal distribution. The uncertainty calculation also accounts for the increased uncertainty incurred by homogenizing the data record.

    Pa

    network

    Source network of the profile.

    N.A.

    type

    Station classification flag.

    N.A.

    filter_check

    Profile quality control flag.

    N.A.

    The dataset is organized into two main tables:

    • unified_header, which contains metadata associated with each ozonesounding profile (idstation, date_of_observation, location_latitude, location_longitude, location_height, network, type, filter_check);
    • unified_value, which includes the actual measurement data (idstation, date_of_observation, time, pressure, geop_alt, temperature, relative_humidity, wind_speed, wind_direction, latitude, longitude, altitude, sample_temperature, o3_partial_pressure, ozone_concentration, ozone_partial_pressure_total_uncertainty).

    To improve accessibility and performance, both tables are further subdivided into year-specific subtables, allowing for more efficient querying and data management across temporal ranges.

    Among the metadata variables included in the unified_header table, type and filter_check play a key role in characterizing the quality and coverage of the ozonesounding profiles. The type variable classifies each station based on the continuity of its time series: stations are grouped into Long Coverage (G), Medium Coverage (Y), or Short Coverage (R), depending on whether they provide at least one profile per month for at least 95% of the months in their time series, spanning:

    • ≥20 years for Long Coverage,
    • ≥10 and <20 years for Medium Coverage,
    • <10 years for Short Coverage.

    The filter_check variable is a quality control flag ranging from 0 to 3, summarizing the results of three structural checks applied to each profile: completeness of monthly coverage (at least three ascents per month), vertical coverage (reaching at least 10 hPa), and vertical resolution (minimum one data point every 100 meters). A higher filter_check value indicates better compliance with these criteria and, consequently, higher data reliability.

    Furthermore, an algorithm was implemented able to merge the different datasets by handling their different

  9. Data from: cigChannel: A large-scale 3D seismic dataset with labeled...

    • zenodo.org
    zip
    Updated Aug 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guangyu Wang; Xinming Wu; Wen Zhang; Guangyu Wang; Xinming Wu; Wen Zhang (2025). cigChannel: A large-scale 3D seismic dataset with labeled paleochannels for advancing deep learning in seismic interpretation [Dataset]. http://doi.org/10.5281/zenodo.11044512
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 1, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Guangyu Wang; Xinming Wu; Wen Zhang; Guangyu Wang; Xinming Wu; Wen Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Assorted channel subset of the cigChannel dataset (V1.0)

    cigChannel (V1.0) is a dataset created by the Computational Interpretation Group (CIG) for the deep-learning-based paleochannel interpretation in 3D seismic volumes. Guangyu Wang, Xinming Wu and Wen Zhang are the main contributors to the dataset.

    cigChannel (V1.0) contains 1,600 synthetic 3D seismic volumes with labels of meandering channels, tributary channel networks and submarine canyons. Seismic impedance and sedimentary facies (only for submarine canyons) volumes correspond to the seismic volumes are also included in this dataset. Components of this dataset are listed below:

    Subset nameSample amount & sizeContentsFeatures
    Meandering channel400, 256x256x256
    • Seismic volumes (float32)
    • Binary-class label volumes (uint8)
    • Seismic impedance volume (float32)
    • Meandering channels.
    • Horizontal, inclined, folded and faulted structures.
    • Noise-free.

    Tributary channel network

    (Formerly distributary channel)

    400, 256x256x256
    • Seismic volumes (float32)
    • Binary-class label volumes (uint8)
    • Seismic impedance volume (float32)
    • Tributary channel networks.
    • Horizontal, inclined, folded and faulted structures.
    • Noise-free.

    Submarine canyon

    (Formerly submarine channel)

    400, 256x256x256
    • Seismic volumes (float32)
    • Binary-class label volumes (uint8)
    • Seismic impedance volumes (float32)
    • Sedimentary facies volumes (int16)
    • Submarine canyons.
    • Horizontal, inclined, folded and faulted structures.
    • Noise-free.
    Assorted channel400, 256x256x256
    • Seismic volumes (float32)
    • Multi-class label volumes (int16)
    • Seismic impedance volumes (float32)
    • Meandering, tributary channel networks and submarine canyons.
    • Horizontal, inclined, folded and faulted structures.
    • Noise-free.

    Further details about this dataset are available in our paper published in Earth System Science Data:

    Wang, G., Wu, X., and Zhang, W.: cigChannel: a large-scale 3D seismic dataset with labeled paleochannels for advancing deep learning in seismic interpretation, Earth Syst. Sci. Data, 17, 3447–3471, https://doi.org/10.5194/essd-17-3447-2025, 2025.

    Due to the size limitation of the uploaded files, we have to publish the dataset in separated versions. This version includes the assorted channel subset, which contains the following zip files:

    • Assorted_Channel_Ip_xx-xx.zip: Seismic impedance volumes of sample No.xx to No.xx.
    • Assorted_Channel_Label_xx-xx.zip: Multi-class label volumes of sample No.xx to No.xx, where the value 0 represents the background (non-channel areas), 101 represents meandering channel No.1, 102 represents meandering channel No.2, etc., 201 represents distributary channel network No.1, 202 represents distributary channel network No.2, etc., and 301 represents the submarine channel (only one submarine channel per volume).
    • Assorted_Channel_Seismic_xx-xx.zip: Seismic (amplitude) volumes of sample No.xx to No.xx.

    Samples in this subset feature different geologic structures:

    • Sample No. 0 to No. 49 feature horizontal structure.
    • Sample No. 50 to No. 99 feature inclined structure.
    • Sample No. 100 to No. 299 feature folded structure.
    • Sample No. 300 to No. 399 feature folded and faulted structures (uploaded as an expansion package).

    Portal to the expansion package: https://doi.org/10.5281/zenodo.15500696

    Portals to the other subsets:

  10. Harmonized global datasets of soil carbon and heterotrophic respiration from...

    • zenodo.org
    bin, nc, txt
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shoji Hashimoto; Shoji Hashimoto; Akihiko Ito; Akihiko Ito; Kazuya Nishina; Kazuya Nishina (2025). Harmonized global datasets of soil carbon and heterotrophic respiration from data-driven estimates, with derived turnover time and Q10 [Dataset]. http://doi.org/10.5281/zenodo.17282577
    Explore at:
    nc, txt, binAvailable download formats
    Dataset updated
    Oct 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shoji Hashimoto; Shoji Hashimoto; Akihiko Ito; Akihiko Ito; Kazuya Nishina; Kazuya Nishina
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We collected all available global soil carbon (C) and heterotrophic respiration (RH) maps derived from data-driven estimates, sourcing them from public repositories and supplementary materials of previous studies (Table 1). All spatial datasets were converted to NetCDF format for consistency and ease of use.

    Because the maps had varying spatial resolutions (ranging from 0.0083° to 0.5°), we harmonized all datasets to a common resolution of 0.5° (approximately 50 km at the equator). We then merged the processed maps by computing the mean, maximum, and minimum values at each grid cell, resulting in harmonized global maps of soil C (for the top 0–30 cm and 0–100 cm depths) and RH at 0.5° resolution.

    Grid cells with fewer than three soil C estimates or fewer than four RH estimates were assigned NA values. Land and water grid cells were automatically distinguished by combining multiple datasets containing soil C and RH information over land.

    Soil carbon turnover time (years), denoted as τ, was calculated under the assumption of a quasi-equilibrium state using the formula:

    τ = CS / RH

    where CS is soil carbon stock and RH is the heterotrophic respiration rate. The uncertainty range of τ was estimated for each grid cell using:

    τmax = CS+ / RH  τmin = CS / RH+

    where CS+ and CS are the maximum and minimum soil C values, and RH+ and RH are the maximum and minimum RH values, respectively.

    To calculate the temperature sensitivity of decomposition (Q10)—the factor by which decomposition rates increase with a 10 °C rise in temperature—we followed the method described in Koven et al. (2017). The uncertainty of Q10 (maximum and minimum values) was derived using τmax and τmin, respectively.

    All files are provided in NetCDF format. The SOC file includes the following variables:
    · longitude, latitude
    · soc: mean soil C stock (kg C m⁻²)
    · soc_median: median soil C (kg C m⁻²)
    · soc_n: number of estimates per grid cell
    · soc_max, soc_min: maximum and minimum soil C (kg C m⁻²)
    · soc_max_id, soc_min_id: study IDs corresponding to the maximum and minimum values
    · soc_range: range of soil C values
    · soc_sd: standard deviation of soil C (kg C m⁻²)
    · soc_cv: coefficient of variation (%)
    The RH file includes:
    · longitude, latitude
    · rh: mean RH (g C m⁻² yr⁻¹)
    · rh_median, rh_n, rh_max, rh_min: as above
    · rh_max_id, rh_min_id: study IDs for max/min
    · rh_range, rh_sd, rh_cv: analogous variables for RH
    The mean, maximum, and minimum values of soil C turnover time are provided as separate files. The Q10 files contain estimates derived from the mean values of soil C and RH, along with associated uncertainty values.

    The harmonized dataset files available in the repository are as follows:

    · harmonized-RH-hdg.nc: global soil heterotrophic respiration map

    · harmonized-SOC100-hdg.nc: global soil C map for 0–100 cm

    · harmonized-SOC30-hdg.nc: global soil C map for 0–30 cm

    · Q10.nc: global Q10 map

    · Turnover-time_max.nc: global soil C turnover time estimated using maximum soil C and minimum RH

    · Turnover-time_min.nc: global soil C turnover time estimated using minimum soil C and maximum RH

    · Turnover-time_mean.nc: global soil C turnover time estimated using mean soil C and RH

    · Turnover-time30_mean.nc: global soil C turnover time estimated using the soil C map for 0-30 cm

    Version history
    Version 1.1: Median values were added. Bug fix for SOC30 (n>2 was inactive in the former version)


    More details are provided in: Hashimoto S. Ito, A. & Nishina K. (in revision) Harmonized global soil carbon and respiration datasets with derived turnover time and temperature sensitivity. Scientific Data

    Reference

    Koven, C. D., Hugelius, G., Lawrence, D. M. & Wieder, W. R. Higher climatological temperature sensitivity of soil carbon in cold than warm climates. Nat. Clim. Change 7, 817–822 (2017).

    Table1 : List of soil carbon and heterotrophic respiration datasets used in this study.

    <td style="width:

    Dataset

    Repository/References (Dataset name)

    Depth

    ID in NetCDF file***

    Global soil C

    Global soil data task 2000 (IGBP-DIS)1

    0–100

    3,-

    Shangguan et al. 2014 (GSDE)2,3

    0–100, 0–30*

    1,1

    Batjes 2016 (WISE30sec)4,5

    0–100, 0–30

    6,7

    Sanderman et al. 2017 (Soil-Carbon-Debt) 6,7

    0–100, 0–30

    5,5

    Soilgrids team and Hengl et al. 2017 (SoilGrids)8,9

    0–30**

    -,6

    Hengl and Wheeler 2018 (LandGIS)10

    0–100, 0–30

    4,4

    FAO 2022 (GSOC)11

    0–30

    -,2

    FAO 2023 (HWSD2)12

    0–100, 0–30

    2,3

    Circumpolar soil C

    Hugelius et al. 2013 (NCSCD)13–15

    0–100, 0–30

    7,8

    Global RH

    Hashimoto et al. 201516,17

    -

    1

    Warner et al. 2019 (Bond-Lamberty equation based)18,19

    -

    2

    Warner et al. 2019 (Subke equation based)18,19

    -

    3

    Tang et al. 202020,21

    -

    4

    Lu et al. 202122,23

    -

    5

    Stell et al. 202124,25

    -

  11. YouTube Comments Data

    • kaggle.com
    zip
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manjit Baishya (2025). YouTube Comments Data [Dataset]. https://www.kaggle.com/datasets/manjitbaishya001/youtube-comments-data
    Explore at:
    zip(2927759 bytes)Available download formats
    Dataset updated
    Jun 2, 2025
    Authors
    Manjit Baishya
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    YouTube
    Description

    🧠 YouTube Comments Dataset: Doctor Mike vs 20 Anti-Vaxxers | Jubilee

    This dataset contains raw and labeled YouTube comments from the Jubilee video titled "Doctor Mike vs 20 Anti-Vaxxers | Surrounded" up to June 2, 2025. The comments reflect a wide range of opinions, sentiments, and arguments related to vaccination, medical science, and public discourse. This makes the dataset particularly valuable for Natural Language Processing (NLP) tasks in real-world, socially charged contexts.

    📌 Dataset Overview

    • Video Title: Doctor Mike vs 20 Anti-Vaxxers | Surrounded
    • Published by: Jubilee
    • Platform: YouTube
    • Collected up to: June 2, 2025
    • Language: English
    • Format: CSV or JSON (depending on upload)
    • Licensing: Public comments on a public platform (refer to YouTube Terms of Service for downstream usage)

    🧪 Key Use Cases

    This dataset is ideal for a wide range of NLP tasks:

    • 🧠 Sentiment Analysis: Classify user opinions into positive, negative, neutral, or irrelevant.
    • 🎯 Toxic Comment Classification: Detect hate speech, misinformation, and emotionally charged content.
    • 🧵 Argument Mining: Identify claims, premises, and conclusions in discussions.
    • 🗣️ Opinion Summarization: Summarize mass opinions from large-scale discourse.
    • 📊 Trend Analysis: Analyze shifts in public opinion regarding vaccines and healthcare narratives.
    • 🔍 Stance Detection: Determine the pro/anti stance of a comment regarding vaccination.
    • 🌐 Multi-label Classification: Assign multiple categories to a comment based on topic, tone, or belief.

    📁 Dataset Columns

    Column NameDescription
    textRaw comment text

    💡 Why This Dataset?

    This dataset offers a real-world sample of social media discourse on a controversial and medically relevant topic. It includes both supportive and oppositional viewpoints and can help train robust, bias-aware NLP models.

    Because the video includes professional input from Doctor Mike and diverse opinions from 20 participants with strong anti-vaccine views, the comment section becomes a rich playground for studying digital rhetoric, misinformation, and science communication.

    🧰 Suggested Tasks

    • Binary or multi-class sentiment classification
    • Toxicity and hate speech detection
    • Conversational analysis
    • Keyword or entity extraction
    • Fine-tuning transformer models (e.g., BERT, RoBERTa)

    📎 A Note on Ethics

    Please use this dataset responsibly. Public comments may include misinformation or strong personal views. Consider including disclaimers or filters when using this data for deployment or educational use. Always be mindful of bias, representation, and the propagation of harmful narratives.

    🔗 Source Acknowledgment

    All comments are sourced from the publicly available YouTube comment section of Jubilee’s video. We are not affiliated with Jubilee or Doctor Mike.

  12. Z

    Data for A method for assessment of the general circulation model quality...

    • data.niaid.nih.gov
    • data.taltech.ee
    Updated Mar 10, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maljutenko, Ilja; Raudsepp, Urmas (2021). Data for A method for assessment of the general circulation model quality using k-means clustering algorithm [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4588509
    Explore at:
    Dataset updated
    Mar 10, 2021
    Dataset provided by
    Department of Marine Systems at Tallinn University of Technology
    Authors
    Maljutenko, Ilja; Raudsepp, Urmas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset consists of simulated and observed salinity/temperature data which were used in the manuscript "A method for assessment of the general circulation model quality using k-means clustering algorithm" submitted to Geoscientific Model Development. The model simulation dataset is from long-term 3D circulation model simulation (Maljutenko and Raudsepp 2014, 2019). The observations are from the "Baltic Sea - Eutrophication and Acidity aggregated datasets 1902/2017 v2018" SMHI (2018).

    The files are in simple comma separated table format without headers. The Dout-t_z_lat_lon_Smod_Sobs_Tmod_Tobs.csv file contains columns with following variables [units]: Time [matlab datenum units], Vertical coordinate [m], latitude [oN], longitude [oE], model salinity [g/kg], observed salinity [g/kg], model temperature [oC], observed temperature [oC].

    The Dout-t_z_lat_lon_dS_dT_K1_K2_K3_K4_K5_K6_K7_K8_K9.csv file contains columns with following variables [units]: 4 first columns are the same as in the previous file, salinity error [g/kg], temperature error [oC], columns 7-8 are integers showing the cluster to which the error pair is designated.

    do_clust_valid_DataFig.m is a Matlab script which reads the two csv files (and optionally mask file Model_mask.mat), performs the clustering analysis and creates plots which are used in Manuscript. The script is organized into %% blocks which can be executed separately (default: ctrl+enter).

    k-means function is used from the Matlab Statistics and Machine Learning Toolbox.

    Additional software used in the do_clust_valid_DataFig.m:

    Author's auxiliary formatting scripts script/ datetick_cst.m
    do_fitfig.m
    do_skipticks.m
    do_skipticks_y.m

    Colormaps are generated using cbrewer.m (Charles, 2021). Moving average smoothing is performed using nanmoving_average.m (Aguilera, 2021).

    Refferences:

    Aguilera, C. A. V., 2021. moving_average v3.1 (Mar 2008) (https://www.mathworks.com/matlabcentral/fileexchange/12276-moving_average-v3-1-mar-2008), MATLAB Central File Exchange. Retrieved March 2, 2021.

    Charles, 2021. cbrewer : colorbrewer schemes for Matlab (https://www.mathworks.com/matlabcentral/fileexchange/34087-cbrewer-colorbrewer-schemes-for-matlab), MATLAB Central File Exchange. Retrieved March 2, 2021.

    Maljutenko, I., Raudsepp, U., 2019. Long-term mean, interannual and seasonal circulation in the Gulf of Finland—the wide salt wedge estuary or gulf type ROFI. Journal of Marine Systems, 195, pp.1-19. doi:10.1016/j.jmarsys.2019.03.004

    Maljutenko, I., Raudsepp, U., 2014. Validation of GETM model simulated long-term salinity fields in the pathway of saltwater transport in response to the Major Baltic Inflows in the Baltic Sea. Measuring and Modeling of Multi-Scale Interactions in the Marine Environment - IEEE/OES Baltic International Symposium 2014, BALTIC 2014, 6887830. doi:10.1109/BALTIC.2014.6887830

    SMHI 2018, Swedish Meteorological and Hydrological Institute (SMHI) (2018). Baltic Sea - Eutrophication and Acidity aggregated datasets 1902/2017 v2018. Aggregated datasets were generated in the framework of EMODnet Chemistry III, under the support of DG MARE Call for Tender EASME/EMFF/2016/006 - lot4. doi:10.6092/595D233C-3F8C-4497-8BD2-52725CEFF96B

  13. PSL Complete Dataset (2016-2025)

    • kaggle.com
    zip
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zeeshan Ahmad (2025). PSL Complete Dataset (2016-2025) [Dataset]. https://www.kaggle.com/datasets/zeeshanahmad124586/psl-complete-dataset-2016-2025
    Explore at:
    zip(768083 bytes)Available download formats
    Dataset updated
    Jul 3, 2025
    Authors
    Zeeshan Ahmad
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    📘 Context

    The Pakistan Super League (PSL) is a premier Twenty20 cricket league in Asia, established in 2015 by the Pakistan Cricket Board (PCB). It features six franchise teams representing major cities in Pakistan, competing in a round-robin format followed by playoffs and a grand final. Known for its high-octane action, world-class players, and massive fan following, PSL has become one of the most exciting T20 leagues globally.

    This dataset captures the complete history of PSL matches from its inception through March 2025, making it a valuable resource for cricket analysts, machine learning practitioners, sports journalists, and fans who want to dive deep into player and team performance trends.

    📦 Content

    • Geography: Pakistan, UAE (Asia)
    • Time Period: February 4, 2016 – March 18, 2025
    • Unit of Analysis: Ball-by-ball records of Pakistan Super League (PSL) matches

    📊 Variables

    The dataset includes ball-level data and match-level summaries, making it ideal for both high-level analytics and granular delivery-by-delivery insights.

    Column NameDescription
    idUnique identifier for each delivery
    match_idUnique identifier for each match
    dateDate of the match
    seasonPSL season during which the match was played
    venueStadium where the match was held
    inningInning number (1 or 2)
    batting_teamTeam currently batting
    bowling_teamTeam currently bowling
    overOver number in the innings (0 to 19)
    ballBall number within the over (1 to 6)
    batterName of the batsman facing the delivery
    bowlerName of the bowler delivering the ball
    non_strikerName of the non-striking batsman
    batsman_runsRuns scored by the batter on that delivery
    extra_runsRuns awarded as extras (wide, no-ball, etc.)
    total_runsTotal runs scored on the delivery (batsman + extras)
    extras_typeType of extra run (e.g., wide, no-ball, bye)
    is_wicket1 if a wicket fell on the delivery; 0 otherwise
    player_dismissedName of the player dismissed on the delivery (if any)
    dismissal_kindMode of dismissal (e.g., caught, bowled, run out)
    fielderName of the fielder involved in the dismissal (if applicable)
    winnerTeam that won the match
    win_byMargin of victory (e.g., "wickets 6", "runs 25")
    match_typeStage of the match (e.g., league, eliminator, qualifier, final)
    player_of_matchBest-performing player of the match
    umpire_1Name of the first on-field umpire
    umpire_2Name of the second on-field umpire
  14. The ATLAS of Traffic Lights

    • zenodo.org
    mp4, zip
    Updated Feb 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rupert Polley; Nikolai Polley; Dominik Heid; Marc Heinrich; Sven Ochs; J. Marius Zöllner; Rupert Polley; Nikolai Polley; Dominik Heid; Marc Heinrich; Sven Ochs; J. Marius Zöllner (2025). The ATLAS of Traffic Lights [Dataset]. http://doi.org/10.5281/zenodo.14775869
    Explore at:
    zip, mp4Available download formats
    Dataset updated
    Feb 12, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rupert Polley; Nikolai Polley; Dominik Heid; Marc Heinrich; Sven Ochs; J. Marius Zöllner; Rupert Polley; Nikolai Polley; Dominik Heid; Marc Heinrich; Sven Ochs; J. Marius Zöllner
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This is an older version: Please use the newest available version.

    Changelog:

    • 31. Jan. 2024: v0.1 - We released a small dataset sample. Until the full release on the 28. Feb. 2025, the annotation format may be subject to change.

    ATLAS


    ATLAS (Applied Traffic Light Annotation Set)
    is a new, publicly available dataset designed to improve traffic light detection for autonomous driving. Existing open-source datasets often omit certain traffic light states and lack camera configurations for near and far distances. To address this, ATLAS features over 33,000 images collected from three synchronized cameras—wide, medium, and tele—with varied fields of view in the German city of Karlsruhe. This setup captures traffic lights at diverse distances and angles, including difficult overhead views. Each of the dataset’s 72,998 bounding boxes is meticulously labeled for 25 unique pictogram-state classes, covering rare but critical states (e.g., red-yellow) and pictograms (straight-right, straight-left). Additional annotations include challenging conditions such as heavy rain. All data is anonymized using state-of-the-art tools. ATLAS provides a comprehensive, high-quality resource for robust traffic light detection, overcoming limitations of existing datasets.

    CameraFOV [°]ResolutionImages
    Front-Medium61 × 39 1920 × 120025,158
    Front-Tele31 × 20 1920 × 12005,109
    Front-Wide106 × 92 2592 × 2048 2,777


    Directory Format:

    We provide the dataset in the following format:

    ├── ATLAS
    ├── train
    ├── front_medium
    ├── images
    ├── front_medium_1722622455-950002160.jpg
    ├── labels
    ├── front_medium_1722622455-950002160.txt
    ├── front_tele
    ├── front_wide
    ├── test
    ├── ATLAS_classes.yaml
    ├── LICENSE
    └── README.md

    Annotation Format:

    Each line in an annotation file describes one bounding box using five fields:

    class_id x_center y_center width height

    1. class_id: An integer indicating the class of the annotated object. The file ATLAS_classes.yaml contains human-readable names corresponding to each numeric label.
    2. x_center, y_center: The normalized coordinates of the bounding box center, relative to the image dimensions (in the range [0,1]), where x_center is measured horizontally and y_center vertically.
    3. width, height: The normalized width and height of the bounding box, also expressed in the range [0,1]. These values are obtained by dividing the bounding box width and height in pixels by the overall image width and height, respectively.

    Terms and Conditions

    The ATLAS Dataset by FZI Research Center for Information Technology is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

    Therefore, the Dataset is only allowed to be used for non-commercial purposes, such as teaching and research. The Licensor thus grants the End User the right to use the dataset for its own internal and non-commercial use and the purpose of scientific research only. There may be inaccuracies, although the Licensor tried and will try its best to rectify any inaccuracy once found. We invite all users to report remarks via mail at polley@fzi.de

    If the dataset is used in media, a link to the Licensor’s website is to be included. In case the End User uses the dataset within research papers, the following publication should be quoted:

    Polley et al.: The ATLAS of Traffic Lights: A Reliable Perception Framework for Autonomous Driving (under review)

  15. Dataset of pdf files

    • kaggle.com
    Updated May 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manisha717 (2024). Dataset of pdf files [Dataset]. https://www.kaggle.com/datasets/manisha717/dataset-of-pdf-files
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 1, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Manisha717
    Description

    The dataset consists of diverse PDF files covering a wide range of topics. These files include reports, articles, manuals, and more, spanning various fields such as science, technology, history, literature, and business. With its broad content, the dataset offers versatility for testing and various purposes, making it valuable for researchers, developers, educators, and enthusiasts alike.

  16. t

    Data from: REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic...

    • researchdata.tuwien.ac.at
    txt, zip
    Updated Jul 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee (2025). REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly [Dataset]. http://doi.org/10.48436/0ewrv-8cb44
    Explore at:
    zip, txtAvailable download formats
    Dataset updated
    Jul 15, 2025
    Dataset provided by
    TU Wien
    Authors
    Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 9, 2025 - Jan 14, 2025
    Description

    REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly

    📋 Introduction

    Robotic manipulation remains a core challenge in robotics, particularly for contact-rich tasks such as industrial assembly and disassembly. Existing datasets have significantly advanced learning in manipulation but are primarily focused on simpler tasks like object rearrangement, falling short of capturing the complexity and physical dynamics involved in assembly and disassembly. To bridge this gap, we present REASSEMBLE (Robotic assEmbly disASSEMBLy datasEt), a new dataset designed specifically for contact-rich manipulation tasks. Built around the NIST Assembly Task Board 1 benchmark, REASSEMBLE includes four actions (pick, insert, remove, and place) involving 17 objects. The dataset contains 4,551 demonstrations, of which 4,035 were successful, spanning a total of 781 minutes. Our dataset features multi-modal sensor data including event cameras, force-torque sensors, microphones, and multi-view RGB cameras. This diverse dataset supports research in areas such as learning contact-rich manipulation, task condition identification, action segmentation, and more. We believe REASSEMBLE will be a valuable resource for advancing robotic manipulation in complex, real-world scenarios.

    ✨ Key Features

    • Multimodality: REASSEMBLE contains data from robot proprioception, RGB cameras, Force&Torque sensors, microphones, and event cameras
    • Multitask labels: REASSEMBLE contains labeling which enables research in Temporal Action Segmentation, Motion Policy Learning, Anomaly detection, and Task Inversion.
    • Long horizon: Demonstrations in the REASSEMBLE dataset cover long horizon tasks and actions which usually span multiple steps.
    • Hierarchical labels: REASSEMBLE contains actions segmentation labels at two hierarchical levels.

    🔴 Dataset Collection

    Each demonstration starts by randomizing the board and object poses, after which an operator teleoperates the robot to assemble and disassemble the board while narrating their actions and marking task segment boundaries with key presses. The narrated descriptions are transcribed using Whisper [1], and the board and camera poses are measured at the beginning using a motion capture system, though continuous tracking is avoided due to interference with the event camera. Sensory data is recorded with rosbag and later post-processed into HDF5 files without downsampling or synchronization, preserving raw data and timestamps for future flexibility. To reduce memory usage, video and audio are stored as encoded MP4 and MP3 files, respectively. Transcription errors are corrected automatically or manually, and a custom visualization tool is used to validate the synchronization and correctness of all data and annotations. Missing or incorrect entries are identified and corrected, ensuring the dataset’s completeness. Low-level Skill annotations were added manually after data collection, and all labels were carefully reviewed to ensure accuracy.

    📑 Dataset Structure

    The dataset consists of several HDF5 (.h5) and JSON (.json) files, organized into two directories. The poses directory contains the JSON files, which store the poses of the cameras and the board in the world coordinate frame. The data directory contains the HDF5 files, which store the sensory readings and annotations collected as part of the REASSEMBLE dataset. Each JSON file can be matched with its corresponding HDF5 file based on their filenames, which include the timestamp when the data was recorded. For example, 2025-01-09-13-59-54_poses.json corresponds to 2025-01-09-13-59-54.h5.

    The structure of the JSON files is as follows:

    {"Hama1": [
        [x ,y, z],
        [qx, qy, qz, qw]
     ], 
     "Hama2": [
        [x ,y, z],
        [qx, qy, qz, qw]
     ], 
     "DAVIS346": [
        [x ,y, z],
        [qx, qy, qz, qw]
     ], 
     "NIST_Board1": [
        [x ,y, z],
        [qx, qy, qz, qw]
     ]
    }

    [x, y, z] represent the position of the object, and [qx, qy, qz, qw] represent its orientation as a quaternion.

    The HDF5 (.h5) format organizes data into two main types of structures: datasets, which hold the actual data, and groups, which act like folders that can contain datasets or other groups. In the diagram below, groups are shown as folder icons, and datasets as file icons. The main group of the file directly contains the video, audio, and event data. To save memory, video and audio are stored as encoded byte strings, while event data is stored as arrays. The robot’s proprioceptive information is kept in the robot_state group as arrays. Because different sensors record data at different rates, the arrays vary in length (signified by the N_xxx variable in the data shapes). To align the sensory data, each sensor’s timestamps are stored separately in the timestamps group. Information about action segments is stored in the segments_info group. Each segment is saved as a subgroup, named according to its order in the demonstration, and includes a start timestamp, end timestamp, a success indicator, and a natural language description of the action. Within each segment, low-level skills are organized under a low_level subgroup, following the same structure as the high-level annotations.

    📁

    The splits folder contains two text files which list the h5 files used for the traning and validation splits.

    📌 Important Resources

    The project website contains more details about the REASSEMBLE dataset. The Code for loading and visualizing the data is avaibile on our github repository.

    📄 Project website: https://tuwien-asl.github.io/REASSEMBLE_page/
    💻 Code: https://github.com/TUWIEN-ASL/REASSEMBLE

    ⚠️ File comments

    Below is a table which contains a list records which have any issues. Issues typically correspond to missing data from one of the sensors.

    RecordingIssue
    2025-01-10-15-28-50.h5hand cam missing at beginning
    2025-01-10-16-17-40.h5missing hand cam
    2025-01-10-17-10-38.h5hand cam missing at beginning
    2025-01-10-17-54-09.h5no empty action at

  17. UWMGI Image Segmentation TFRecords

    • kaggle.com
    zip
    Updated Jun 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tt195361 (2022). UWMGI Image Segmentation TFRecords [Dataset]. https://www.kaggle.com/tt195361/uwmgi-image-segmentation-tfrecords
    Explore at:
    zip(1590547976 bytes)Available download formats
    Dataset updated
    Jun 17, 2022
    Authors
    tt195361
    Description

    This dataset is a collection of TFRecord files to train the models for the UW-Madison GI Tract Image Segmentation competition, specifically on TPU. Each TFRecord file contains the following data:

    NameTypeDescription
    idbytessample ID taken from the 'id' column in 'train.csv', utf-8 encoded.
    case numberint64case number taken from 'id' at caseNNN
    day numberint64day number taken from 'id' at dayNN
    slice numberint64slice number taken from 'id' at slice_NNNN
    imagebytesnumpy format image bytes read from the associated file
    maskbytesPNG format mask bytes generated from the 'segmentation' column in 'train.csv'
    foldint64fold number that this sample belongs to
    heightint64slice height taken from the file name
    widthint64slice width taken from the file name
    space heightfloat32pixel spacing height taken from the file name
    space widthfloat32pixel spacing width taken from the file name
    large bowel dice coeffloat32how well the model predicted for large bowel
    small bowel dice coeffloat32how well the model predicted for small bowel
    stomach dice coeffloat32how well the model predicted for stomach
    slice countint64number of slices for case/day

    A sample format definition to read the record is as follows.

      TFREC_FORMAT = {
        'id': tf.io.FixedLenFeature([], tf.string),
        'case_no': tf.io.FixedLenFeature([], tf.int64),
        'day_no': tf.io.FixedLenFeature([], tf.int64),
        'slice_no': tf.io.FixedLenFeature([], tf.int64),
        'image': tf.io.FixedLenFeature([], tf.string),
        'mask': tf.io.FixedLenFeature([], tf.string),
        'fold': tf.io.FixedLenFeature([], tf.int64),
        'height': tf.io.FixedLenFeature([], tf.int64),
        'width': tf.io.FixedLenFeature([], tf.int64),
        'space_h': tf.io.FixedLenFeature([], tf.float32),
        'space_w': tf.io.FixedLenFeature([], tf.float32),
        'large_bowel_dice_coef': tf.io.FixedLenFeature([], tf.float32),
        'small_bowel_dice_coef': tf.io.FixedLenFeature([], tf.float32),
        'stomach_dice_coef': tf.io.FixedLenFeature([], tf.float32),
        'slice_count': tf.io.FixedLenFeature([], tf.int64),
      }
    

    Here is the notebook used to make this dataset. Here is the notebook to train a model by using this dataset.

  18. World Population Data

    • kaggle.com
    zip
    Updated Jan 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sazidul Islam (2024). World Population Data [Dataset]. https://www.kaggle.com/datasets/sazidthe1/world-population-data/discussion
    Explore at:
    zip(14672 bytes)Available download formats
    Dataset updated
    Jan 1, 2024
    Authors
    Sazidul Islam
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    World
    Description

    Context

    The world's population has undergone remarkable growth, exceeding 7.5 billion by mid-2019 and continuing to surge beyond previous estimates. Notably, China and India stand as the two most populous countries, with China's population potentially facing a decline while India's trajectory hints at surpassing it by 2030. This significant demographic shift is just one facet of a global landscape where countries like the United States, Indonesia, Brazil, Nigeria, and others, each with populations surpassing 100 million, play pivotal roles.

    The steady decrease in growth rates, though, is reshaping projections. While the world's population is expected to exceed 8 billion by 2030, growth will notably decelerate compared to previous decades. Specific countries like India, Nigeria, and several African nations will notably contribute to this growth, potentially doubling their populations before rates plateau.

    Content

    This dataset provides comprehensive historical population data for countries and territories globally, offering insights into various parameters such as area size, continent, population growth rates, rankings, and world population percentages. Spanning from 1970 to 2023, it includes population figures for different years, enabling a detailed examination of demographic trends and changes over time.

    Dataset

    Structured with meticulous detail, this dataset offers a wide array of information in a format conducive to analysis and exploration. Featuring parameters like population by year, country rankings, geographical details, and growth rates, it serves as a valuable resource for researchers, policymakers, and analysts. Additionally, the inclusion of growth rates and world population percentages provides a nuanced understanding of how countries contribute to global demographic shifts.

    This dataset is invaluable for those interested in understanding historical population trends, predicting future demographic patterns, and conducting in-depth analyses to inform policies across various sectors such as economics, urban planning, public health, and more.

    Structure

    This dataset (world_population_data.csv) covering from 1970 up to 2023 includes the following columns:

    Column NameDescription
    RankRank by Population
    CCA33 Digit Country/Territories Code
    CountryName of the Country
    ContinentName of the Continent
    2023 PopulationPopulation of the Country in the year 2023
    2022 PopulationPopulation of the Country in the year 2022
    2020 PopulationPopulation of the Country in the year 2020
    2015 PopulationPopulation of the Country in the year 2015
    2010 PopulationPopulation of the Country in the year 2010
    2000 PopulationPopulation of the Country in the year 2000
    1990 PopulationPopulation of the Country in the year 1990
    1980 PopulationPopulation of the Country in the year 1980
    1970 PopulationPopulation of the Country in the year 1970
    Area (km²)Area size of the Country/Territories in square kilometer
    Density (km²)Population Density per square kilometer
    Growth RatePopulation Growth Rate by Country
    World Population PercentageThe population percentage by each Country

    Acknowledgment

    The primary dataset was retrieved from the World Population Review. I sincerely thank the team for providing the core data used in this dataset.

    © Image credit: Freepik

  19. Landsat to Sentinel-2 (LS2S2), a dataset for the fusion of joint Landsat and...

    • zenodo.org
    pdf, txt, zip
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julien MICHEL; Julien MICHEL (2025). Landsat to Sentinel-2 (LS2S2), a dataset for the fusion of joint Landsat and Sentinel-2 Satellite Image Time Series [Dataset]. http://doi.org/10.5281/zenodo.15471890
    Explore at:
    zip, txt, pdfAvailable download formats
    Dataset updated
    Oct 29, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julien MICHEL; Julien MICHEL
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description

    This dataset comprises joint Sentinel-2 and Landsat-8 and Landsat-9 Satellite Image Time Series (SITS) over year 2022. It is composed of 64 Areas Of Interest (AOIs) for the training set and 41 AOIs for the testing set, each covering an area of 9.9x9.9 km² (990x990 pixels for Sentinel-2 and 330x330 pixels for Landsat) covering Europe as well as a few spots in west Africa and north of South America. All dates with more than 25% of clear pixels over the AOI are included in the dataset. This yields a total of 2 984 Sentinel-2 images and 1 609 Landsat-8 and 9 images in the training set. Additional statistics about the number of dates per AOI for each sensor are presented in the following table:

    Sentinel-2 Landsat
    TotalAverageMin MaxTotalAverageMinMax
    Train split2984441194158126061
    Test split16093969873618156

    Important Notice : A multi-year, wolrdwide complement to this dataset called LS2S2MYWW is available here: https://doi.org/10.6096/1029 (it is too large to be hosted on Zenodo).

    For each sensor, Top-of-Canopy surface reflectance from level 2 products are used. The spectral bands included in the dataset are presented in the following table. It can be observed that the Landsat sensor does not have Red Edge bands or wide Infra-Red, and conversely Sentinel-2 sensor does not retrieve Land Surface Temperature (LST).

    Sentinel-2 Landsat Description
    BandResolution (m)BandResolution (m)
    B130. Deep blue
    B0210.B230.Blue
    B0310.B330.Green
    B0410.B430.Red
    B05, B06, B0720. Red Edge
    B0820.B530.Near Infra-Red
    B8a10. Wide Near Infra-Red
    B11, B1220.B6, B730.Short Wavelength Infra-Red
    B10100.Land Surface Temperature

    In addition to the spectral bands, corresponding quality masks have been used to derive a validity mask for each date of each sensor. This dataset has been gathered through the OpenEO API, in the frame of following work:

    Julien Michel, Jordi Inglada. Temporal Attention Multi-Resolution Fusion of Satellite Image Time-Series, applied to Landsat-8 and Sentinel-2: all bands, any time, at best spatial resolution. 2025. ⟨hal-05101526⟩


    The source code associated with the paper, including the download script that created the dataset, is available here: https://github.com/Evoland-Land-Monitoring-Evolution/tamrfsits

    File organization

    Main zip files

    Two main zip files are provided: ls2s2_train.zip contains the training split, and ls2s2_test.zip contains the test split. Both zip files contains one internal zip file per AOI, organized as follows.

    Note that we provide test_31TCJ_12.zip as a sample for previewing the content of the dataset before downloading the train or test split.

    The dataset comprises one zip file per AOI. The naming pattern for the zip file is as follows {test/train}_{mgrs_tile}_{subtile}.zip. The {test/train} field indicates if the file is part of the training or testing set. The {mgrs_tile} field correspond to the MGRS tile from which the AOI has been sampled. The {subtile} field indicate which su-btile of the MGRS tile has been sampled. Sub-tiles correspond to 1024x1204 internal JPEG2000 tiles of the Sentinel-2 product. Their numbering follows the lexicographical order (columns then rows).

    Each zip file contains the following layout:

    {train/test}/{mgrs_tile}_{subtile}/
    {mgrs_tile}_{subtile}.json
    {mgrs_tile}_{subtile}_sentinel2_synopsis.png
    {mgrs_tile}_{subtile}_landsat_synopsis.png
    sentinel2/
    index.csv
    2022mmdd/
    sentinel2_mask_2022mmdd.tif
    sentinel2_bands_2022mmdd.tif
    ...
    landsat/
    index.csv
    index_pan.csv
    2022mmdd/
    landsat_mask_2022mmdd.tif
    landsat_bands_2022mmdd.tif
    landsat_pan_mask_2022mmdd.tif
    landsat_pan_2022mmdd.tif
    ...

    Files description

    Here is a description of the different files:

    <td style="height:

    File nameDescription
    {mgrs_tile}_{subtile}.jsonA json file describing the AOI.
    {mgrs_tile}_{subtile}_sentinel2_synopsis.pngA synopsis PNG file allowing to see all Sentinel-2 images and mask of the AOI at a glance.
    {mgrs_tile}_{subtile}_landsat_synopsis.pngA synopsis PNG file allowing to see all Landsat images and mask of the AOI at a glance.
    index.csvThe csv file indexing the Sentinel-2 or Landsat data for the AOI.
    index_pan.csvThe csv file indexing the Landsat panchromatic data for the AOI.
    sentinel2_mask_2022mmdd.tif990x990 pixels GeoTIFF file containing the validity mask for the current date (0 for valid and 1 for invalid).
    sentinel2_bands_2022mmdd.tif990x990 pixels GeoTIFF file containing the Sentinel-2 spectral bands are in surface reflectance * 10 000. Band order is B2, B3, B4, B5, B6, B7, B8, B8A, B11, B12. 20m bands are up-sampled to 10m resolution by means of bicubic interpolation. No data pixels have -10 000 value.
    landsat_mask_2022mmdd.tif330x330 pixels GeoTIFF file containing the validity mask for the current date (0 for valid and 1 for invalid). Spatial resolution is 30m.
  20. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Loist, Skadi; Samoilova, Evgenia (Zhenya) (2024). Film Circulation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7887671

Film Circulation dataset

Explore at:
Dataset updated
Jul 12, 2024
Dataset provided by
Film University Babelsberg KONRAD WOLF
Authors
Loist, Skadi; Samoilova, Evgenia (Zhenya)
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.

The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This

Search
Clear search
Close search
Google apps
Main menu