50 datasets found

o
Turkish Natural Language Inference Dataset
opendatabay.com
.undefined
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Turkish Natural Language Inference Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/f4951f96-ebbc-43bf-bed5-36dce9796e6e
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 5, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Education & Learning Analytics
Description
The NLI-TR dataset, comprising two distinct datasets known as SNLI-TR and MNLI-TR, provides an unparalleled opportunity for research within the natural language processing (NLP) and machine learning communities. Its primary purpose is to facilitate natural language inference research in the Turkish language. The datasets consist of meticulously curated natural language inference data, which has been carefully translated into Turkish from original English sources. This resource enables researchers to develop automated models specifically tailored for making inferences on texts in this vibrant language. Furthermore, it offers valuable insights into cross-lingual generalisation capabilities, allowing investigation into how models trained on data from one language perform when applied to another. It supports tasks ranging from sentence paraphrasing and classification to question answering scenarios, featuring Turkish sentences labelled to indicate whether a premise and hypothesis entail, contradict, or are neutral towards each other.

Columns

The dataset records typically include the following columns:

premise: This column contains sentences written in Turkish. These sentences have been translated from the English sources used for the original SNLI and MNLI datasets. It serves as the contextual information or the initial statement from which an inference is to be made.

hypothesis: This column also contains sentences in Turkish, translated from the English SNLI and MNLI datasets. It represents the conclusion or the statement whose relationship to the premise is being assessed.

label: This column assigns a relationship between the premise and hypothesis. Possible values include:

'entailment': The hypothesis logically follows from the premise.

'contradiction': The hypothesis directly contradicts the premise.

'neutral': The hypothesis is unrelated to or neither entails nor contradicts the premise.

domain: An optional column assigned by some authors, primarily used when inferences are made between sentences across different semantic domains, such as weather, sports, or finance.

Distribution

The data is typically provided in CSV file format. It includes both training and validation sets to support model development and evaluation. Key files mentioned are SNLI_tr_train.csv for training models, slni_tr_validation for testing or validating model accuracy on unseen data, and multinli_tr_validation_{matched / mismatched}.csv for additional validation on complex scenarios. The multinli_tr_train.csv file contains Turkish sentences with their corresponding labels. The dataset is considered large-scale, with the multinli_tr_train.csv file, for instance, containing approximately 392,700 records.

Usage

This dataset is ideal for various applications and use cases in NLP and machine learning:

Developing Natural Language Inference (NLI)-based question answering systems for the Turkish language.

Training sentiment analysis algorithms to discern sentiment in Turkish text.

Building Machine Learning Chatbots that leverage NLI to understand conversational context and respond appropriately in Turkish.

Conducting general NLI research in Turkish.

Investigating cross-lingual generalisation capabilities of NLP models.

Tasks such as sentence paraphrasing, classification, and other NLP techniques applied to Turkish text.

Coverage

The dataset's scope is primarily focused on the Turkish language, making it relevant for global use. The data has been translated from English sources, expanding its utility for cross-lingual studies. A specific time range or demographic scope for the data collection is not detailed in the available sources.

License

CC0

Who Can Use It

The NLI-TR dataset is intended for a broad audience interested in natural language processing and machine learning, including:

The natural language processing (NLP) community.

The machine learning community.

Seasoned and budding researchers looking to delve into NLI tasks.

Developers aiming to create automated models for Turkish language inference.

Academics and practitioners exploring the cross-lingual generalisation capabilities of models.

Anyone working on NLP tasks in Turkish, such as sentence paraphrasing, text classification, or question answering.

Dataset Name Suggestions

NLI-TR (Turkish NLI Research)

Turkish Natural Language Inference Dataset

SNLI-TR and MNLI-TR Turkish Data

Turkish Textual Entailment Data

Attributes

Original Data Source: NLI-TR (Turkish NLI Research)
College Placement Predictor Dataset
kaggle.com
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SameerProgrammer (2023). College Placement Predictor Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/7298157
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/7298157
Dataset updated
Dec 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
SameerProgrammer
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
1. About the Dataset:

Description: Dive into the world of college placements with this dataset designed to unravel the factors influencing student placement outcomes. The dataset comprises crucial parameters such as IQ scores, CGPA (Cumulative Grade Point Average), and placement status. Aspiring data scientists, researchers, and enthusiasts can leverage this dataset to uncover patterns and insights that contribute to a deeper understanding of successful college placements.

2. Projects Ideas:

Project Idea 1: Predictive Modeling for College Placements Utilize machine learning algorithms to build a predictive model that forecasts a student's likelihood of placement based on their IQ scores and CGPA. Evaluate and compare the effectiveness of different algorithms to enhance prediction accuracy.

Project Idea 2: Feature Importance Analysis Conduct a feature importance analysis to identify the key factors that significantly influence placement outcomes. Gain insights into whether IQ, CGPA, or a combination of both plays a more dominant role in determining success.

Project Idea 3: Clustering Analysis of Placement Trends Apply clustering techniques to group students based on their placement outcomes. Explore whether distinct clusters emerge, shedding light on common characteristics or trends among students who secure placements.

Project Idea 4: Correlation Analysis with External Factors Investigate the correlation between the provided data (IQ, CGPA, placement) and external factors such as internship experience, extracurricular activities, or industry demand. Assess how these external factors may complement or influence placement success.

Project Idea 5: Visualization of Placement Dynamics Over Time Create dynamic visualizations to illustrate how placement trends evolve over time. Analyze trends, patterns, and fluctuations in placement rates to identify potential cyclical or seasonal influences on student placements.

3. Columns Explanation:

IQ:

Definition: Intelligence Quotient, a measure of a person's intellectual abilities.

Data Type: Numeric

Range: Typically, IQ scores range from 70 to 130, with 100 being the average.

CGPA:

Definition: Cumulative Grade Point Average, a measure of a student's overall academic performance.

Data Type: Numeric

Range: Typically, CGPA is on a scale of 0 to 4, with 4 being the highest possible score.

Placement:

Definition: Binary variable indicating whether a student secured a placement (1) or not (0).

Data Type: Categorical (Binary)

Values: 1 (Placement secured) or 0 (No placement).

These columns collectively provide a comprehensive snapshot of a student's intellectual abilities, academic performance, and their success in securing a placement. Analyzing this dataset can offer valuable insights into the dynamics of college placements and inform strategies for optimizing student outcomes.
f
Data from: A Multi-Pathology Ballistocardiogram Dataset for Cardiac Function...
figshare.com
zip
Updated Feb 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jing Zhan; Zhengying Li; Xiaoyan Wu; Chao Zhang; Tao Zhao; Kewei Chen; Zhibing Lu (2025). A Multi-Pathology Ballistocardiogram Dataset for Cardiac Function Monitoring and Arrhythmia Assessment [Dataset]. http://doi.org/10.6084/m9.figshare.28399790.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28399790.v1
Dataset updated
Feb 14, 2025
Dataset provided by
figshare
Authors
Jing Zhan; Zhengying Li; Xiaoyan Wu; Chao Zhang; Tao Zhao; Kewei Chen; Zhibing Lu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset comprises 153 subfolders within a primary directory named data, derived from 85 participants. Each participant typically contributes 2–3 subfolders, contingent on the completeness and quality of their M-mode echocardiography (UCG) recordings. Subfolder names follow the format: hdata + SubjectID + EJ/XJ/ZJ to denote the specific cardiac region captured in the ultrasound data:EJ denotes M-mode imaging of the mitral valve, XJ denotes M-mode imaging of the left ventricle, and ZJ denotes M-mode imaging of the aortic valve.For instance, a participant with identifier “001” may have subfolders named hdata1EJ, hdata1XJ, and/or hdata1ZJ, corresponding to each available M-mode echocardiographic segment. Each subfolder contains five distinct files, described in detail below.1 BCG J-peak file(1) File name: hdata+subjectID+EJ/XJ/ZJ_BCG.csv(2) Content: J-peak positions in the BCG signal, presented in two columns:(3) The first column provides the raw data point index.(4) The second column specifies the corresponding time (in seconds) for each J-peak.2 ECG R-peak file(1) File name: hdata+subjectID+EJ/XJ/ZJ_ECG.csv(2) Content: R-peak positions in the ECG signal, also in two columns:(3) The first column provides the raw data point index.(4) The second column specifies the corresponding time (in seconds) for each R-peak.3 Ultrasound video(1) File name: hdata+subjectID+EJ/XJ/ZJ_UCG.AVI(2) Content: An AVI-format video of the simultaneously acquired M-mode echocardiogram. The suffix EJ, XJ, or ZJ indicates whether the imaging targeted the mitral valve, left ventricle, or aortic valve, respectively.4 Signal data(1) File name: signal.csv(2) Content: Three columns of time-series data sampled at 100 Hz. Raw BCG signal (Column 1).ECG data (Lead V2 or another designated lead) (Column 2). Denoised BCG signal (Column 3), derived using the Enhanced Singular Value Thresholding (ESVT) algorithm.5 Signal visualization(1) File name: signal.pdf(2) Content: A graphical representation of the signals from signal.csv. This file facilitates quick inspection of waveform alignment and overall signal quality.In addition to the data directory, an Additional_info folder provides participant demographic and clinical details. Each row in subject_info.csv corresponds to an individual participant, listing their ID, sex, weight, height, age, heart rate, ejection fraction(EF) (%). These parameters establish an informative link between each participant’s anthropometric profile, cardiac function metrics, and the corresponding BCG, ECG, and ultrasound data.
Data from: A large synthetic dataset for machine learning applications in...
zenodo.org
csv, json, png, zip
Updated Mar 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis (2025). A large synthetic dataset for machine learning applications in power transmission grids [Dataset]. http://doi.org/10.5281/zenodo.13378476
Explore at:
zip, png, csv, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13378476
Dataset updated
Mar 25, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.

This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

Data generation algorithm

The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

Network

The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

Time series

The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

Usage

The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

Selecting a particular country

This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

import pandas as pd CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

CH_gens_list = CH_gens.dropna().squeeze().to_list()

Finally, we can import all the time series of Swiss generators from a given data table with

pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

Averaging over time

This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

hourly_loads = pd.read_csv('loads_2018_3.csv')

To get a daily average of the loads, we can use:

daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

This results in series of length 364. To average further over entire weeks and get series of length 52, we use:

weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

Source code

The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

Funding

This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.
A
‘San Francisco Citywide Performance Metrics’ analyzed by Analyst-2
analyst-2.ai
Updated Dec 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘San Francisco Citywide Performance Metrics’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-gov-san-francisco-citywide-performance-metrics-5bef/d127d9f6/?iid=004-781&v=presentation
Explore at:
Dataset updated
Dec 19, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
San Francisco
Description
Analysis of ‘San Francisco Citywide Performance Metrics’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/1bcd26b4-c457-4ce5-985c-80a797830e7b on 11 February 2022.

--- Dataset description provided by original source is as follows ---

A. SUMMARY This data set reports key performance metrics for departments and programs in the City and County of San Francisco.

B. HOW THE DATASET IS CREATED City departments report updates about their key metrics to the Controller’s Office. The Controller's Office uses an online application to collect and organize this data. Departments update most metrics once or twice each year. Some metrics may not display data for every year.

C. UPDATE PROCESS Most metrics update twice each year. Updates with results for the first 6 months of each fiscal year are published in the spring, typically between April and May. Updates with results for each full fiscal year are published in the fall, typically in November.

D. HOW TO USE THIS DATASET Each row represents one metric and one fiscal year for a department, with multiple values for each fiscal year. Some metrics do not include values for all fields or fiscal years. Some results for the latest fiscal year are unavailable because of known lags in reporting. Users should review any data notes reported for each row for guidance about interpreting values. All values are reported as numbers without formatting, but the column [Measure Data Type] describes the intended format. For example, a value appearing as “0.50” with [Measure Data Type] reported as “Percent” should be displayed as “50%”.

--- Original source retains full ownership of the source dataset ---
o
NLP Preprocessed Sentiment Dataset
opendatabay.com
.undefined
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). NLP Preprocessed Sentiment Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/6323a1b5-7112-49bd-ad55-c1ef6968abc3
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
This dataset is a substantial collection of over 241,000 English-language comments, gathered from various online platforms. Each comment within the dataset has been carefully annotated with a sentiment label: 0 for negative sentiment, 1 for neutral, and 2 for positive. The primary aim of this dataset is to facilitate the training and evaluation of multi-class sentiment analysis models, designed to work effectively with real-world text data. The dataset has undergone a preprocessing stage, ensuring comments are in lowercase, and are cleaned of punctuation, URLs, numbers, and stopwords, making it readily usable for Natural Language Processing (NLP) pipelines.

Columns

Comment: This column contains the user-generated text content.

Sentiment: This column provides the corresponding sentiment label for each comment, where 0 denotes Negative, 1 denotes Neutral, and 2 denotes Positive.

Distribution

The dataset comprises over 241,000 records. While the specific file format is not detailed, such datasets are typically provided in a tabular format, often as a CSV file. It is structured with two distinct columns as described above, suitable for direct integration into machine learning workflows.

Usage

This dataset is ideally suited for a variety of applications and use cases, including: * Training sentiment classifiers utilising advanced models such as LSTM, BiLSTM, CNN, BERT, or RoBERTa. * Evaluating the efficacy of different preprocessing and tokenisation strategies for text data. * Benchmarking NLP models on multi-class classification tasks to assess their performance. * Supporting educational projects and research initiatives in the fields of opinion mining or text classification. * Fine-tuning transformer models on a large and diverse collection of sentiment-annotated text.

Coverage

The dataset's coverage is global, comprising English-language comments. It focuses on general user-generated text content without specific demographic notes. The dataset is listed with a version of 1.0.

License

CC0

Who Can Use It

This dataset is suitable for individuals and organisations involved in data science and analytics. Intended users include: * Data Scientists and Machine Learning Engineers for developing and deploying sentiment analysis models. * Researchers and Academics for studies in NLP, text classification, and opinion mining. * Students undertaking educational projects in artificial intelligence and machine learning.

Dataset Name Suggestions

Multi-class Comment Sentiment Data

User Text Sentiment Collection

Online Comment Sentiment Analysis Dataset

English Sentiment Labelled Comments

Preprocessed Sentiment Dataset

Attributes

Original Data Source: Sentiment Analysis Dataset
D
DEEM Panel Survey
darus.uni-stuttgart.de
Updated Apr 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sophia Hess; Andreas Wahl; Johannes Engels (2025). DEEM Panel Survey [Dataset]. http://doi.org/10.18419/DARUS-4050
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.18419/DARUS-4050
Dataset updated
Apr 8, 2025
Dataset provided by
DaRUS
Authors
Sophia Hess; Andreas Wahl; Johannes Engels
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
NXTGN
Description
Description This dataset contains responses to a yearly panel survey among entrepreneurs in Baden-Württemberg. Based on the DEEM. research project's collected data (see the DEEM. project's website for more information), we survey founders to track the development of startups in our region and to assess the quality and performance of the local Entrepreneurial Ecosystem (see Empirical entrepreneurial ecosystem research: A guide to creating multilevel datasets for more information on this multilevel dataset). Surveys are sent out to all founders of currently active startups. Surveys were made available in German and English, with respondents being able to choose their preferred language at the start of the survey. For any questions about this survey or the underlying research project, please contact us. Aims Research Integrating data on founders, firms, regional contexts and socioeconomic indicators, this data enables deeper insights into patterns and dynamics across different levels of Entrepreneurial Ecosystems (EEs) - insights often missed in traditional single-source and cross-sectional data studies. As such, this data contributes to the understanding of EEs as multilevel phenomena crucial for understanding and promoting productive entrepreneurship and economic development. Respondents We aim for a full population survey every year, instead of drawing samples. This means that all startups with an identifiable means of contact are contacted, with every potential respondent receiving a personalized survey link. Response rates typically vary between 10-15%. To increase response rates, the following approach is used: The survey is left open for a period of two months for founders to answer at their own pace, with periodic reminders sent. While the survey is designed as a panel to track founders' perceptions over time, we cannot guarantee that founders participate in more than one wave. As such, this dataset can be more accurately viewed as a "macro-panel" on the Entrepreneurial Ecosystem of BW. Usage This repository is structured as follows: The global codebook contains information on the broad concepts addressed in each survey wave, as well as the question batteries asked to address these concepts. As such it serves as a broad overview for researchers, to understand whether the data suits their research interests, and whether the relevant questions were asked in multiple years (i.e. panel analyses are possible), or whether they were included as one-off batteries. It is only available in English. The folders include the responses obtained for each survey year, as well as a wave-specific codebook with more detailed information. In contrast to the global codebook, these codebooks contain the questions and response options in both English and German, as well as meta-information about question filters and sub-groups if applicable. Additionally, for each item, basic summary statistics (number of responses per category, number of non-responses) are reported. Data for each survey wave is made available in .csv format (Comma-Separated Values) with a header row. The columns are separated via semicolons (";"). This has been done to avoid conflicts, as some text responses and system variables included commas. Please consider this when loading and using the data with the analysis software of your choice. Should any issues arise in downloading, opening or using this data, please contact us for help.
OMI/Aura Level 2 Sulphur Dioxide (SO2) Trace Gas Column Data 1-Orbit Subset...
catalog.data.gov
datasets.ai
+2more
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NASA/GSFC/SED/ESD/GCDC/GESDISC (2025). OMI/Aura Level 2 Sulphur Dioxide (SO2) Trace Gas Column Data 1-Orbit Subset and Collocated Swath along CloudSat V003 (OMSO2_CPR) at GES DISC [Dataset]. https://catalog.data.gov/dataset/omi-aura-level-2-sulphur-dioxide-so2-trace-gas-column-data-1-orbit-subset-and-collocated-s
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
This is a CloudSat-collocated subset of the original product OMSO2, for the purposes of the A-Train mission. The goal of the subset is to select and return OMI data that are within +/-100 km across the CloudSat track. The resultant OMI subset swath is sought to be about 200 km cross-track of CloudSat. Even though collocated with CloudSat, this subset can serve many other A-Train applications. (The shortname for this CloudSat-collocated subset of the original product OMSO2 Product is OMSO2_CPR_V003) This document describes the original OMI SO2 product (OMSO2) produced from global mode UV measurements of the Ozone Monitoring Instrument (OMI). OMI was launched on July 15, 2004 on the EOS Aura satellite, which is in a sun-synchronous ascending polar orbit with 1:45pm local equator crossing time. The data collection started on August 17, 2004 (orbit 482) and continues to this day with only minor data gaps. The minimum SO2 mass detectable by OMI is about two orders of magnitude smaller than the detection threshold of the legacy Total Ozone Mapping Spectrometer (TOMS) SO2 data (1978-2005) [Krueger et al 1995]. This is due to smaller OMI footprint and the use of wavelengths better optimized for separating O3 from SO2. The product file, called a data granule, covers the sunlit portion of the orbit with an approximately 2600 km wide swath containing 60 pixels per viewing line. During normal operations, 14 or 15 granules are produced daily, providing fully contiguous coverage of the globe. Currently, OMSO2 products are not produced when OMI goes into the "zoom mode" for one day every 452 orbits (~32 days). For each OMI pixel we provide 4 different estimates of the column density of SO2 in Dobson Units (1DU=2.69x10^16 molecules/cm2) obtained by making different assumptions about the vertical distribution of the SO2. However, it is important to note that in most cases the precise vertical distribution of SO2 is unimportant. The users can use either the SO2 plume height, or the center of mass altitude (CMA) derived from SO2 vertical distribution, to interpolate between the 4 values: 1)Planetary Boundary Layer (PBL) SO2 column (ColumnAmountSO2_PBL), corresponding to CMA of 0.9 km. 2)Lower tropospheric SO2 column (ColumnAmountSO2_TRL), corresponding to CMA of 2.5 km. 3)Middle tropospheric SO2 column, (ColumnAmountSO2_TRM), usually produced by volcanic degassing, corresponding to CMA of 7.5 km, 4)Upper tropospheric and Stratospheric SO2 column (ColumnAmountSO2_STL), usually produced by explosive volcanic eruption, corresponding to CMA of 17 km. The accuracy and precision of the derived SO2 columns vary significantly with the SO2 CMA and column amount, observational geometry, and slant column ozone. OMI becomes more sensitive to SO2 above clouds and snow/ice, and less sensitive to SO2 below clouds. Preliminary error estimates are discussed below (see Data Quality Assessment). OMSO2 files are stored in EOS Hierarchical Data Format (HDF-EOS5). Each file contains data from the day lit portion of an orbit (53 minutes). There are approximately 14 orbits per day. The maximum file size for the OMSO2 data product is about 9 Mbytes.
R
Data from: Collaborative AI In the Workplace: Enhancing Organizational...
repod.icm.edu.pl
ods
Updated Mar 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sowa, Konrad; Przegalinska, Aleksandra; Triantoro, Tamilla; Kovbasiuk, Anna; Ciechanowski, Leon; Freeman, Richard (2025). Collaborative AI In the Workplace: Enhancing Organizational Performance Through Resource-Based and Task-Technology Fit Perspectives [Dataset]. http://doi.org/10.18150/8RYMKT
Explore at:
ods(84243), ods(59309)Available download formats
Unique identifier
https://doi.org/10.18150/8RYMKT
Dataset updated
Mar 25, 2025
Dataset provided by
RepOD
Authors
Sowa, Konrad; Przegalinska, Aleksandra; Triantoro, Tamilla; Kovbasiuk, Anna; Ciechanowski, Leon; Freeman, Richard
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset funded by
National Science Centre (Poland)
Description
The published data were the basis for the analysis and publication of an article on human-AI collaboration in the workplace. A human-AI collaboration experiment was conducted, in which participants were asked to perform several tasks typical of knowledge-based work. Participants were divided into two groups - cooperating with generative AI and working independently.There two files published in this dataset, containing data for two seperate parts of the study:Collaboration experiment dataTextual dataAd. 1 This dataset includes the results of pre-test (columns T-AM) and post-test surveys (columns AN-AV), participants' responses to tasks (columns (AW-AZ), assessment of these tasks by independent judges (columns D-S). There are 94 observations and 51 variables.Ad. 2 This dataset includes logs of participants' conversations with a bot from the group cooperating with AI. Column A is the message sent by user or bot, column B specifies who sent the message and column C is participants' anonymous username.
AI2 ARC - Advanced Science Question
kaggle.com
Updated Nov 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). AI2 ARC - Advanced Science Question [Dataset]. https://www.kaggle.com/datasets/thedevastator/advanced-science-question-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 30, 2023
Dataset provided by
Kaggle
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
AI2 ARC - Advanced Science Question

Promoting research in advanced question-answering

By ai2_arc (From Huggingface) [source]

About this dataset

The ai2_arc dataset, also known as the A Challenge Dataset for Advanced Question-Answering in Grade-School Level Science, is a comprehensive and valuable resource created to facilitate research in advanced question-answering. This dataset consists of a collection of 7,787 genuine grade-school level science questions presented in multiple-choice format.

The primary objective behind assembling this dataset was to provide researchers with a powerful tool to explore and develop question-answering models capable of tackling complex scientific inquiries typically encountered at a grade-school level. The questions within this dataset are carefully crafted to test the knowledge and understanding of various scientific concepts in an engaging manner.

The ai2_arc dataset is further divided into two primary sets: the Challenge Set and the Easy Set. Each set contains numerous highly curated science questions that cover a wide range of topics commonly taught at a grade-school level. These questions are designed specifically for advanced question-answering research purposes, offering an opportunity for model evaluation, comparison, and improvement.

In terms of data structure, the ai2_arc dataset features several columns providing vital information about each question. These include columns such as question, which contains the text of the actual question being asked; choices, which presents the multiple-choice options available for each question; and answerKey, which indicates the correct answer corresponding to each specific question.

Researchers can utilize this comprehensive dataset not only for developing advanced algorithms but also for training machine learning models that exhibit sophisticated cognitive capabilities when it comes to comprehending scientific queries from a grade-school perspective. Moreover, by leveraging these meticulously curated questions, researchers can analyze performance metrics such as accuracy or examine biases within their models' decision-making processes.

In conclusion, the ai2_arc dataset serves as an invaluable resource for anyone involved in advanced question-answering research within grade-school level science education. With its extensive collection of genuine multiple-choice science questions spanning various difficulty levels, researchers can delve into the intricate nuances of scientific knowledge acquisition, processing, and reasoning, ultimately unlocking novel insights and innovations in the field

Research Ideas

Developing advanced question-answering models: The ai2_arc dataset provides a valuable resource for training and evaluating advanced question-answering models. Researchers can use this dataset to develop and test algorithms that can accurately answer grade-school level science questions.

Evaluating natural language processing (NLP) models: NLP models that aim to understand and generate human-like responses can be evaluated using this dataset. The multiple-choice format of the questions allows for objective evaluation of the model's ability to comprehend and provide correct answers.

Assessing human-level performance: The dataset can be used as a benchmark to measure the performance of human participants in answering grade-school level science questions. By comparing the accuracy of humans with that of AI systems, researchers can gain insights into the strengths and weaknesses of both approaches

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: ARC-Challenge_test.csv | Column name | Description | |:--------------|:--------------------------------------------------------------------------------| | question | The text content of each question being asked. (Text) | | choices | A list of multiple-choice options associated with each question. (List of Text) | | answerKey | The correct answer option (choice) for a particular question. (Text) |

File: ARC-Easy_test.csv | Column name | Description ...
c
Carbon dioxide data from 2002 to present derived from satellite observations...
cds-stable-bopen.copernicus-climate.eu
cds.climate.copernicus.eu
netcdf
Updated Mar 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ECMWF (2025). Carbon dioxide data from 2002 to present derived from satellite observations [Dataset]. http://doi.org/10.24381/cds.f74805c8
Explore at:
netcdfAvailable download formats
Unique identifier
https://doi.org/10.24381/cds.f74805c8
Dataset updated
Mar 11, 2025
Dataset authored and provided by
ECMWF
License
https://object-store.os-api.cci2.ecmwf.int:443/bopen-cds2-stable-catalogue/licences/ghg-cci/ghg-cci_0911d58e24365e15589377902e562c6e9231290f75b14ddc3c7cb5fd09a265af.pdfhttps://object-store.os-api.cci2.ecmwf.int:443/bopen-cds2-stable-catalogue/licences/ghg-cci/ghg-cci_0911d58e24365e15589377902e562c6e9231290f75b14ddc3c7cb5fd09a265af.pdf
Time period covered
Oct 1, 2002 - Dec 31, 2023
Description
This dataset provides observations of atmospheric carbon dioxide (CO₂) amounts obtained from observations collected by several current and historical satellite instruments. Carbon dioxide is a naturally occurring Greenhouse Gas (GHG), but one whose abundance has been increased substantially above its pre-industrial value of some 280 ppm by human activities, primarily because of emissions from combustion of fossil fuels, deforestation and other land-use change. The annual cycle (especially in the northern hemisphere) is primarily due to seasonal uptake and release of atmospheric CO2 by terrestrial vegetation. Atmospheric carbon dioxide abundance is indirectly observed by various satellite instruments. These instruments measure spectrally resolved near-infrared and/or infrared radiation reflected or emitted by the Earth and its atmosphere. In the measured signal, molecular absorption signatures from carbon dioxide and other constituent gasses can be identified. It is through analysis of those absorption lines in these radiance observations that the averaged carbon dioxide abundance in the sampled atmospheric column can be determined. The software used to analyse the absorption lines and determine the carbon dioxide concentration in the sampled atmospheric column is referred to as the retrieval algorithm. For this dataset, carbon dioxide abundances have been determined by applying several algorithms to different satellite instruments. Typically, different algorithms have different strengths and weaknesses and therefore, which product to use for a given application typically depends on the application. The data set consists of 2 types of products:

column-averaged mixing ratios of CO2, denoted XCO2 mid-tropospheric CO2 columns.

The XCO2 products have been retrieved from SCIAMACHY/ENVISAT, TANSO-FTS/GOSAT, TANSO-FTS2/GOSAT2 and OCO-2. The mid-tropospheric CO2 product has been retrieved from the IASI instruments on-board the Metop satellite series and from AIRS. The XCO2 products are available as Level 2 (L2) products (satellite orbit tracks) and as Level 3 (L3) product (gridded). The L2 products are available as individual sensor products (SCIAMACHY: BESD and WFMD algorithms; GOSAT: OCFP and SRFP algorithms) and as a multi-sensor merged product (EMMA algorithm). The L3 XCO2 product is provided in OBS4MIPS format. The IASI and AIRS products are available as L2 products generated with the NLIS algorithm. This data set is updated on a yearly basis, with each update cycle adding (if required) a new data version for the entire period, up to one year behind real time. This dataset is produced on behalf of C3S with the exception of the SCIAMACHY and AIRS L2 products that were generated in the framework of the GHG-CCI project of the European Space Agency (ESA) Climate Change Initiative (CCI).
w
AASG Wells Data for the EGS Test Site Planning and Analysis Task...
data.wu.ac.at
Updated Mar 6, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HarvestMaster (2018). AASG Wells Data for the EGS Test Site Planning and Analysis Task aasg_geothermal_boreholes (2).zip [Dataset]. https://data.wu.ac.at/schema/geothermaldata_org/NGFjMGJmM2YtNDM3ZS00ODBlLTg5MWItMTg2ZWRmMDlmNWQy
Explore at:
Dataset updated
Mar 6, 2018
Dataset provided by
HarvestMaster
Area covered
dcf6e738f4102f6c238890f4832b0ad4853b0700
Description
AASG Wells Data for the EGS Test Site Planning and Analysis Task Temperature measurement data obtained from boreholes for the Association of American State Geologists (AASG) geothermal data project. Typically bottomhole temperatures are recorded from log headers, and this information is provided through a borehole temperature observation service for each state. Service includes header records, well logs, temperature measurements, and other information for each borehole. Information presented in Geothermal Prospector was derived from data aggregated from the borehole temperature observations for all states. For each observation, the given well location was recorded and the best available well identifier (name), temperature and depth were chosen. The "Well Name Source," "Temp. Type" and "Depth Type" attributes indicate the field used from the original service. This data was then cleaned and converted to consistent units. The accuracy of the observation's location, name, temperature or depth was note assessed beyond that originally provided by the service.

AASG bottom hole temperature datasets were downloaded from repository.usgin.org between the dates of May 16th and May 24th, 2013.

Datasets were cleaned to remove null and non-real entries, and data converted into consistent units across all datasets

Methodology for selecting best temperature and depth attributes from column headers in AASG BHT Data sets:

Temperature: CorrectedTemperature - best MeasuredTemperature - next best Depth: DepthOfMeasurement - best TrueVerticalDepth - next best DrillerTotalDepth - last option Well Name/Identifier: APINo - best WellName - next best ObservationURI - last option

The column headers are as follows: gid = internal unique ID src_state = the state from which the well was downloaded (note: the low temperature wells in Idaho are coded as "ID_LowTemp", while all other wells are simply the two character state abbreviation) source_url = the url for the source WFS service or Excel file temp_c = "best" temperature in Celsius temp_type = indicates whether temp_c comes from the corrected or measured temperature header column in the source document depth_m = "best" depth in meters depth_type = indicates whether depth_m comes from the measured, true vertical, or driller total depth header column in the source document well_name = "best" well name or ID name_src = indicates whether well_name came from apino, wellname, or observationuri header column in the source document lat_wgs84 = latitude in wgs84 lon_wgs84 = longitude in wgs84 state = state in which the point is located county = county in which the point is located AASG Wells Data for the EGS Test Site Planning and Analysis Task Temperature measurement data obtained from boreholes for the Association of American State Geologists (AASG) geothermal data project. Typically bottomhole temperatures are recorded from log headers, and this information is provided through a borehole temperature observation service for each state. Service includes header records, well logs, temperature measurements, and other information for each borehole. Information presented in Geothermal Prospector was derived from data aggregated from the borehole temperature observations for all states. For each observation, the given well location was recorded and the best available well identified (name), temperature and depth were chosen. The “Well Name Source,” “Temp. Type” and “Depth Type” attributes indicate the field used from the original service. This data was then cleaned and converted to consistent units. The accuracy of the observation’s location, name, temperature or depth was note assessed beyond that originally provided by the service.

AASG bottom hole temperature datasets were downloaded from repository.usgin.org between the dates of May 16th and May 24th, 2013.

Datasets were cleaned to remove “null” and non-real entries, and data converted into consistent units across all datasets

Methodology for selecting ”best” temperature and depth attributes from column headers in AASG BHT Data sets:

• Temperature: • CorrectedTemperature – best • MeasuredTemperature – next best • Depth: • DepthOfMeasurement – best • TrueVerticalDepth – next best • DrillerTotalDepth – last option • Well Name/Identifier • APINo – best • WellName – next best • ObservationURI - last option.

The column headers are as follows:

• gid = internal unique ID

• src_state = the state from which the well was downloaded (note: the low temperature wells in Idaho are coded as “ID_LowTemp”, while all other wells are simply the two character state abbreviation)

• source_url = the url for the source WFS service or Excel file

• temp_c = “best” temperature in Celsius

• temp_type = indicates whether temp_c comes from the corrected or measured temperature header column in the source document

• depth_m = “best” depth in meters

• depth_type = indicates whether depth_m comes from the measured, true vertical, or driller total depth header column in the source document

• well_name = “best” well name or ID

• name_src = indicates whether well_name came from apino, wellname, or observationuri header column in the source document

• lat_wgs84 = latitude in wgs84

• lon_wgs84 = longitude in wgs84

• state = state in which the point is located

• county = county in which the point is located
o
English-ASL Language Interoperability Dataset
opendatabay.com
.undefined
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). English-ASL Language Interoperability Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/2e2e9584-b0d7-417f-8460-ab0184e20a58
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 6, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Health Information Systems & Technology
Description
This dataset offers a powerful synthetic English-ASL gloss parallel corpus that was generated in 2012, providing an exciting opportunity to bridge the cultural divide between English and American Sign Language. By exploring this cross-cultural language interoperability, it aims to connect linguistic communities and bring together aspects of communication often seen as separated. The data supports innovative approaches to machine translation models and helps to uncover further insights into bridging linguistic divides.

Columns

The dataset consists of two primary columns:

gloss: This column contains the ASL gloss representation in a given context for any keyword or phrase spoken in ASL. It provides English representations of an ASL sign, helping users to better understand the correlation between written English and ASL signs.

text: This column provides a written translation or interpretation in English for each corresponding ASL sign within the gloss column.

Distribution

The dataset is typically provided in a CSV file format, specifically referenced as train.csv. It comprises two columns: gloss and text. The gloss column contains 81,123 unique values, while the text column contains 81,016 unique values. This indicates the dataset consists of approximately 81,123 records.

Usage

This dataset can be used for a variety of applications and use cases, including:

Creating a variety of scenarios which emulate common conversation topics found in everyday life, such as greetings, family activities, or home chores, by pairing individual words with their translations into ASL signs.

Helping users to gain proficiency over time in having coherent conversations using both spoken languages and signed languages such as American Sign Language (ASL).

Developing generative ASL-English bilingual chat bots.

Benchmarking different translation models to measure their accuracy.

Assessing various translation techniques and determining which is the most successful in translating from English to ASL.

Further exploration using predictive models to unravel complex linguistic problems that often abound cross-cultural communication barriers.

Coverage

The dataset focuses on the linguistic relationship between English and American Sign Language. While specific demographic details are not provided, its general availability is noted as global. The data was generated in 2012, offering a snapshot from that time.

License

CC0

Who Can Use It

This dataset is ideal for:

Researchers interested in linguistics, natural language processing (NLP), and machine translation.

Individuals seeking to learn and practise American Sign Language, aiming to improve their proficiency in coherent conversations using both spoken and signed communication.

Developers and data scientists working on AI models, chat bots, or translation systems that involve ASL and English.

Anyone interested in cross-cultural communication and bridging linguistic divides through language interoperability.

Dataset Name Suggestions

ASL-English Parallel Gloss Corpus 2012

American Sign Language Translation Data

English-ASL Language Interoperability Dataset

ASL Gloss Representation Corpus

Bilingual ASL-English Communication Data

Attributes

Original Data Source: AslgPc12 (English-ASL Gloss Parallel Corpus 2012)
Iris Species Dataset and Database
kaggle.com
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ghanshyam Saini (2025). Iris Species Dataset and Database [Dataset]. https://www.kaggle.com/datasets/ghnshymsaini/iris-species-dataset-and-database
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 15, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ghanshyam Saini
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Iris Flower Dataset

This is a classic and very widely used dataset in machine learning and statistics, often serving as a first dataset for classification problems. Introduced by the British statistician and biologist Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems," it is a foundational resource for learning classification algorithms.

Overview:

The dataset contains measurements for 150 samples of iris flowers. Each sample belongs to one of three species of iris:

Iris setosa

Iris versicolor

Iris virginica

For each flower, four features were measured:

Sepal length (in cm)

Sepal width (in cm)

Petal length (in cm)

Petal width (in cm)

The goal is typically to build a model that can classify iris flowers into their correct species based on these four features.

File Structure:

The dataset is usually provided as a single CSV (Comma Separated Values) file, often named iris.csv or similar. This file typically contains the following columns:

sepal_length (cm): Numerical. The length of the sepal of the iris flower.

sepal_width (cm): Numerical. The width of the sepal of the iris flower.

petal_length (cm): Numerical. The length of the petal of the iris flower.

petal_width (cm): Numerical. The width of the petal of the iris flower.

species: Categorical. The species of the iris flower (either 'setosa', 'versicolor', or 'virginica'). This is the target variable for classification.

Content of the Data:

The dataset contains an equal number of samples (50) for each of the three iris species. The measurements of the sepal and petal dimensions vary between the species, allowing for their differentiation using machine learning models.

How to Use This Dataset:

Download the iris.csv file.

Load the data using libraries like Pandas in Python.

Explore the data through visualization and statistical analysis to understand the relationships between the features and the different species.

Build classification models (e.g., Logistic Regression, Support Vector Machines, Decision Trees, K-Nearest Neighbors) using the sepal and petal measurements as features and the 'species' column as the target variable.

Evaluate the performance of your model using appropriate metrics (e.g., accuracy, precision, recall, F1-score).

The dataset is small and well-behaved, making it excellent for learning and experimenting with various classification techniques.

Citation:

When using the Iris dataset, it is common to cite Ronald Fisher's original work:

Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179-188.

Data Contribution:

Thank you for providing this classic and fundamental dataset to the Kaggle community. The Iris dataset remains an invaluable resource for both beginners learning the basics of classification and experienced practitioners testing new algorithms. Its simplicity and clear class separation make it an ideal starting point for many data science projects.

If you find this dataset description helpful and the dataset itself useful for your learning or projects, please consider giving it an upvote after downloading. Your appreciation is valuable!
Dataset for Kinetic assessment of straight punch strike techniques: rear...
zenodo.org
Updated Feb 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dariusz Mosler; Dariusz Mosler; Kacprzak Jakub; Kacprzak Jakub; Jacek Wąsik; Jacek Wąsik (2024). Dataset for Kinetic assessment of straight punch strike techniques: rear cross and lead jab by boxers [Dataset]. http://doi.org/10.5281/zenodo.10729180
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.10729180
Dataset updated
Feb 29, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dariusz Mosler; Dariusz Mosler; Kacprzak Jakub; Kacprzak Jakub; Jacek Wąsik; Jacek Wąsik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General Description:

This dataset offers a detailed look at the biomechanics of boxing punches, specifically the jab and rear cross, by capturing the acceleration of different parts of the arm and the force generated during these movements. The integration of IMU and force plate data provides a rich source of information for analyzing the efficiency, power, and technique of athletes, offering valuable insights for coaches, researchers, and athletes in the field of sports science and physical culture.

This dataset comprises a collection of Excel files containing measurements of right hand pads from various participants. Each file is named to reflect the participant's identifier followed by the specific measurement context (e.g., "participant1_right hand pad.xlsx"). The original dataset included participant surnames, but these have been replaced with unique numerical identifiers (e.g., "participant1", "participant2", etc.) to ensure anonymity. The dataset is intended for use in physical culture and sports science research, providing valuable data for studies on hand measurements and their implications in sports and physical activities.

File Naming Convention:

Files within this dataset follow a standardized naming convention: . The is a unique numerical identifier assigned to each participant (e.g., "participant1"), and describes the specific measurement focus of the file (e.g., "right hand pad").

Column Descriptions:

Time: The timestamp or duration associated with each measurement session, indicating when each set of measurements was taken, typically essential for analyzing movement or force over time.

1x, 1y, 1z: Acceleration data (in milli-g) for the IMU placed on the fist. These columns capture the three-dimensional acceleration of the fist during boxing techniques, with 'x', 'y', and 'z' representing the acceleration along the horizontal, vertical, and depth axes, respectively. This data is crucial for understanding the speed and direction of the punch.

2x, 2y, 2z: Acceleration data (in milli-g) for the IMU on the forearm. Similar to the fist data, these columns provide insights into the forearm's movement dynamics during the execution of boxing techniques, offering a comprehensive view of the arm's acceleration.

3x, 3y, 3z: Acceleration data (in milli-g) for the IMU placed on the upper arm. These measurements complement the fist and forearm data, providing a complete picture of the arm's acceleration and movement patterns during different boxing punches.

fx, fy, fz: Force measurements (in Newtons) from the force plate. These columns represent the force exerted in the x (horizontal), y (vertical), and z (depth or forward/backward) directions. Force plate data is essential for analyzing the power and effectiveness of boxing techniques, as well as the athlete's balance and stability during the punch execution.

Files are related to code on github : https://github.com/Dareczin/boxing_biomechanics

Based on this code, data is saved in two folders. Original data capture (5 strikes in one measurement) and files after computing, to extract each event (strike) as separate file.
Data from: EEG functional connectivity analysis for the study of the brain...
zenodo.org
data.niaid.nih.gov
Updated Mar 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anna Falivene; Chiara Cantiani; Chiara Dondena; Elena Maria Riboldi; Valentina Riva; Caterina Piazza; Anna Falivene; Chiara Cantiani; Chiara Dondena; Elena Maria Riboldi; Valentina Riva; Caterina Piazza (2025). EEG functional connectivity analysis for the study of the brain maturation in the first year of life [Dataset]. http://doi.org/10.5281/zenodo.11621601
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.11621601
Dataset updated
Mar 25, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anna Falivene; Chiara Cantiani; Chiara Dondena; Elena Maria Riboldi; Valentina Riva; Caterina Piazza; Anna Falivene; Chiara Cantiani; Chiara Dondena; Elena Maria Riboldi; Valentina Riva; Caterina Piazza
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset is related to 146 typically developing infants who underwent baseline electrophysiological (EEG) data recording at 6 (T6) and 12 (T12) months of age. The recordings were made using a dense-array EGI system (Geodesic EEG System (GES) 300 or 400, Electric Geodesic, In., Eugene, Oregon, USA) equipped with 60/64-electrode or 128-electrode caps (HydroCel Geodesic Sensor net).
Whole-brain functional connectivity (FC) metrics were extracted, for multiple frequency bands with the aim to evaluate brain maturation in the first year of life in terms of EEG functional connectivity. In addition, Bayley test was administered at 24 months of age to explore possible relation between brain network connectivity and cognitive functions.

The dataset includes subjects’ sociodemographic and individual factors (such as age, sex, socioeconomic status, gestational week and birth weight); functional connectivity metrics (such as the magnitude-squared coherence index, phase lag index (PLI), and parameters characterizing the minimum spanning tree built from the PLI index) computed in the delta (2-4 Hz), theta (4-6 Hz), low-alpha (6-9-Hz), high-alpha (9-13 Hz), beta (13-30 Hz) and gamma (30-45 Hz) frequency bands; and the Bayley test raw scores, assessed at 24 months of age.

In particular, each row in the database corresponds to a subject and each column to a different variable.
Column A, Subject code;
Column B, Time point: 1 = data related to the EEG recording performed at six months of age; 2 = data related to the EEG recording performed at twelve months of age;
Column C, Sex: 0= males; 1=females;
Column D, Age (expressed in days) at T6;
Column E, Age (expressed in days) at T12;
Column F, Family socio-economic status;
Column G, Gestational age expressed in weeks;
Column H, Birth weight expressed in grams;
Column I and J, Bayley Cognitive Composite Score and Griffiths developmental quotient both assessed at 6 months of age;
Column K and L, Number of electrodes of the used electrode-caps for T6 and T12, respectively;
Columns from M to BM, FC metrics for all frequency bands;
Columns from BO to BQ, raw cognitive, receptive and expressive Bayley test scores assessed at 24 months of age;
Column BR, Composite language metric derived from the expressive and receptive Bayley scores.

If you use this dataset please cite the following manucript: Falivene, A.; Cantiani, C.; Dondena, C.; Riboldi, E.M.; Riva, V.; Piazza, C. EEG Functional Connectivity Analysis for the Study of the Brain Maturation in the First Year of Life. Sensors 2024, 24, 4979. All details about subjects, data acquisition and signal processing pipeline are described in the manuscript.
a
Apprenticeship
arc-gis-hub-home-arcgishub.hub.arcgis.com
eo-geohub.com
+1more
Updated Jan 27, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EO_Analytics (2017). Apprenticeship [Dataset]. https://arc-gis-hub-home-arcgishub.hub.arcgis.com/datasets/54e7b6999f7b4335827f35972b46784d_0/about
Explore at:
Dataset updated
Jan 27, 2017
Dataset authored and provided by
EO_Analytics
Area covered
Description
ACTIVE APPRENTICESHIP PROGRAM PARTICIPANTS, APPRENTICESHIP SPONSORS & APPROVED TRAINING DELIVERY AGENTS (TDA) BY TRADE - PROVINCE OF ONTARIO

Demographic data for those enrolled in Ontario apprenticeship programs. The data covers 158 trades.

For each trade, the data includes:

sector Red Seal (yes/no) total number of participants gender age cohort number of approved sponsors number of public training delivery agents number of private training delivery agents

A program participant is an individual who is active in an apprenticeship training program for a specific trade. To be included in the count, program participants must have a training agreement with a sponsor that is currently registered with the Ministry or was in a registered status within the last 12 months. Program participants currently without registered training agreements are usually between apprenticeship jobs or attending classroom training.

A sponsor is responsible for an apprentice’s on-the-job training. Sponsors are typically employers, unions or local apprenticeship committees. Because one sponsor may sponsor apprentices in multiple trades, this column cannot be summed to arrive at the total number of unique apprenticeship sponsors.

Training delivery agents (TDAs) are approved by the Ministry to deliver the classroom training component of apprenticeship programs. Because one TDA may deliver classroom training for multiple trades, this column cannot be summed to arrive at the total number of unique TDAs.

Explanation of Dataset Column Headings:

TRADE SECTOR: Trades are grouped into four sectors: Construction, Industrial, Motive Power, and Service.

REDSEAL = Red Seal certification in a trade means the holder can work in any province that participates in the Interprovincial Red Seal Program without further assessment or testing.

TOTAL PARTICIPANTS = Total number of apprenticeship program participants in the trade; where the number for of total participants is less than 20, the actual number is not displayed to protect the privacy of individual program participants.

MALE/FEMALE: Where the number for one of the genders is less than 20, actual numbers are not displayed to protect the privacy of individual program participants. Instead, the information will display as either “< 20” or “> 20”.

Column headings for the age cohort for participants:

• AGE UNDER 20 = Under 20 years of age

• AGE 20-29 = Between 20 and 29

• AGE 30-44 = Between 30 and 44

• AGE 45-54 = Between 45 and 54

• AGE 55PLUS = Over 55 years of age

PUBLIC TDAS = Training delivery agents that are Colleges of Applied Arts and Technology; Institutes of Technology and Advanced Learning

PRIVATE TDAS = Training delivery agents funded privately, including union-run training centres, private career colleges and employer-run training centres. * Unique Counts cannot be derived by summing these columns. Unique totals are provided on the Grand Total Summary Line.

Notes:

Cells with counts representing individuals that total less than 20 have been supressed for privacy reasons indicated with an "<20".

When appropriate, counts of remaining age ranges in the same trade, where providing those values allow the suppressed value to be calculated,

were also suppressed for privacy reasons and indicated with an ">20".

2.As one employer may sponsor apprentices in multiple trades, this column cannot be summed to arrive at the total number of unique apprenticeship sponsors.

3.Training Delivery Agents (TDAs) are approved by the Ministry to deliver in the in-class component of apprenticeship program.

4.As one TDA may deliver classroom training for multiple trades, this column cannot be summed to arrive at the total number of unique TDAs.
Sentinel-5P TROPOMI Near-Real-Time (NRT) Tropospheric Ozone Column V2...
s.cnmilf.com
cmr.earthdata.nasa.gov
+4more
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NASA/GSFC/SED/ESD/GCDC/GESDISC (2025). Sentinel-5P TROPOMI Near-Real-Time (NRT) Tropospheric Ozone Column V2 (S5P_L2_O3_TCL_NRT) at GES DISC [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/sentinel-5p-tropomi-near-real-time-nrt-tropospheric-ozone-column-v2-s5p-l2-o3-tcl-nrt-at-g
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
The Sentinel-5P TROPOMI Near Real Time (NRT) Tropospheric Ozone Column V2 (S5P_L2_O3_TCL_NRT) at GES DISC is the near real time version of the offline S5P_L2_O3_TCL product. These data are typically available within three hours of measurement as required by the Land Atmosphere NRT Capability Earth Observing System (LANCE). These data are intended for a rapid turnaround assessment and are only archived for up to ten days. Users who require a longer data record, or wish to conduct rigorous analysis should use the offline version of this product S5P_L2_O3_TCL. The Copernicus Sentinel-5 Precursor (Sentinel-5P or S5P) satellite mission is one of the European Space Agency's (ESA) new mission family - Sentinels, and it is a joint initiative between the Kingdom of the Netherlands and the ESA. The sole payload on Sentinel-5P is the TROPOspheric Monitoring Instrument (TROPOMI), which is a nadir-viewing 108 degree Field-of-View push-broom grating hyperspectral spectrometer, covering the wavelength of ultraviolet-visible (UV-VIS, 270nm to 495nm), near infrared (NIR, 675nm to 775nm), and shortwave infrared (SWIR, 2305nm-2385nm). Sentinel-5P is the first of the Atmospheric Composition Sentinels and is expected to provide measurements of ozone, NO2, SO2, CH4, CO, formaldehyde, aerosols and cloud at high spatial, temporal and spectral resolutions. Copernicus Sentinel-5P tropospheric ozone data products are retrieved by the convective-cloud-differential (CCD) algorithm to derive the tropospheric ozone columns and by the cloud slicing algorithm (CSA) to derive mean upper tropospheric ozone volume mixing ratios above the clouds. The S5P_TROPOZ_CCD algorithm uses TROPOMI Level-2 ozone column measurements and the cloud parameters provided by the S5P_CLOUD_OCRA and S5P_CLOUD_ROCINN, the average values of the tropospheric ozone columns below 270 hpa can be determined. The S5P_TROPOZ_CSA algorithm uses the correlation between could top pressure and the ozone column above the cloud. The retrieval depends on the amount of measurements with a high cloud cover. The products are restricted in the tropical region (-20 degrees to 20 degrees of latitude). The main outputs of the Copernicus S5P/TROPOMI tropospheric ozone product include the tropospheric ozone column and corresponding errors, upper tropospheric ozone and corresponding errors, stratospheric ozone column and corresponding errors, and the retrieval quality flags. The data are stored in an enhanced netCDF-4 format. Data are stored in individual files, or granules, that contain one orbit of information. The data are stored in netCDF4 data format and files and complete files are ~20 MB.
HRV-ACC: a dataset with R-R intervals and accelerometer data for the...
zenodo.org
data.niaid.nih.gov
csv, txt, zip
Updated Aug 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kamil Książek; Kamil Książek; Wilhelm Masarczyk; Wilhelm Masarczyk; Przemysław Głomb; Przemysław Głomb; Michał Romaszewski; Michał Romaszewski; Iga Stokłosa; Iga Stokłosa; Piotr Ścisło; Piotr Ścisło; Paweł Dębski; Paweł Dębski; Robert Pudlo; Robert Pudlo; Piotr Gorczyca; Piotr Gorczyca; Magdalena Piegza; Magdalena Piegza (2023). HRV-ACC: a dataset with R-R intervals and accelerometer data for the diagnosis of psychotic disorders using a Polar H10 wearable sensor [Dataset]. http://doi.org/10.5281/zenodo.8171266
Explore at:
txt, zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8171266
Dataset updated
Aug 9, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Kamil Książek; Kamil Książek; Wilhelm Masarczyk; Wilhelm Masarczyk; Przemysław Głomb; Przemysław Głomb; Michał Romaszewski; Michał Romaszewski; Iga Stokłosa; Iga Stokłosa; Piotr Ścisło; Piotr Ścisło; Paweł Dębski; Paweł Dębski; Robert Pudlo; Robert Pudlo; Piotr Gorczyca; Piotr Gorczyca; Magdalena Piegza; Magdalena Piegza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT

The issue of diagnosing psychotic diseases, including schizophrenia and bipolar disorder, in particular, the objectification of symptom severity assessment, is still a problem requiring the attention of researchers. Two measures that can be helpful in patient diagnosis are heart rate variability calculated based on electrocardiographic signal and accelerometer mobility data. The following dataset contains data from 30 psychiatric ward patients having schizophrenia or bipolar disorder and 30 healthy persons. The duration of the measurements for individuals was usually between 1.5 and 2 hours. R-R intervals necessary for heart rate variability calculation were collected simultaneously with accelerometer data using a wearable Polar H10 device. The Positive and Negative Syndrome Scale (PANSS) test was performed for each patient participating in the experiment, and its results were attached to the dataset. Furthermore, the code for loading and preprocessing data, as well as for statistical analysis, was included on the corresponding GitHub repository.

BACKGROUND

Heart rate variability (HRV), calculated based on electrocardiographic (ECG) recordings of R-R intervals stemming from the heart's electrical activity, may be used as a biomarker of mental illnesses, including schizophrenia and bipolar disorder (BD) [Benjamin et al]. The variations of R-R interval values correspond to the heart's autonomic regulation changes [Berntson et al, Stogios et al]. Moreover, the HRV measure reflects the activity of the sympathetic and parasympathetic parts of the autonomous nervous system (ANS) [Task Force of the European Society of Cardiology the North American Society of Pacing Electrophysiology, Matusik et al]. Patients with psychotic mental disorders show a tendency for a change in the centrally regulated ANS balance in the direction of less dynamic changes in the ANS activity in response to different environmental conditions [Stogios et al]. Larger sympathetic activity relative to the parasympathetic one leads to lower HRV, while, on the other hand, higher parasympathetic activity translates to higher HRV. This loss of dynamic response may be an indicator of mental health. Additional benefits may come from measuring the daily activity of patients using accelerometry. This may be used to register periods of physical activity and inactivity or withdrawal for further correlation with HRV values recorded at the same time.

EXPERIMENTS

In our experiment, the participants were 30 psychiatric ward patients with schizophrenia or BD and 30 healthy people. All measurements were performed using a Polar H10 wearable device. The sensor collects ECG recordings and accelerometer data and, additionally, prepares a detection of R wave peaks. Participants of the experiment had to wear the sensor for a given time. Basically, it was between 1.5 and 2 hours, but the shortest recording was 70 minutes. During this time, evaluated persons could perform any activity a few minutes after starting the measurement. Participants were encouraged to undertake physical activity and, more specifically, to take a walk. Due to patients being in the medical ward, they received instruction to take a walk in the corridors at the beginning of the experiment. They were to repeat the walk 30 minutes and 1 hour after the first walk. The subsequent walks were to be slightly longer (about 3, 5 and 7 minutes, respectively). We did not remind or supervise the command during the experiment, both in the treatment and the control group. Seven persons from the control group did not receive this order and their measurements correspond to freely selected activities with rest periods but at least three of them performed physical activities during this time. Nevertheless, at the start of the experiment, all participants were requested to rest in a sitting position for 5 minutes. Moreover, for each patient, the disease severity was assessed using the PANSS test and its scores are attached to the dataset.

The data from sensors were collected using Polar Sensor Logger application [Happonen]. Such extracted measurements were then preprocessed and analyzed using the code prepared by the authors of the experiment. It is publicly available on the GitHub repository [Książek et al].

Firstly, we performed a manual artifact detection to remove abnormal heartbeats due to non-sinus beats and technical issues of the device (e.g. temporary disconnections and inappropriate electrode readings). We also performed anomaly detection using Daubechies wavelet transform. Nevertheless, the dataset includes raw data, while a full code necessary to reproduce our anomaly detection approach is available in the repository. Optionally, it is also possible to perform cubic spline data interpolation. After that step, rolling windows of a particular size and time intervals between them are created. Then, a statistical analysis is prepared, e.g. mean HRV calculation using the RMSSD (Root Mean Square of Successive Differences) approach, measuring a relationship between mean HRV and PANSS scores, mobility coefficient calculation based on accelerometer data and verification of dependencies between HRV and mobility scores.

DATA DESCRIPTION

The structure of the dataset is as follows. One folder, called HRV_anonymized_data contains values of R-R intervals together with timestamps for each experiment participant. The data was properly anonymized, i.e. the day of the measurement was removed to prevent person identification. Files concerned with patients have the name treatment_X.csv, where X is the number of the person, while files related to the healthy controls are named control_Y.csv, where Y is the identification number of the person. Furthermore, for visualization purposes, an image of the raw RR intervals for each participant is presented. Its name is raw_RR_{control,treatment}_N.png, where N is the number of the person from the control/treatment group. The collected data are raw, i.e. before the anomaly removal. The code enabling reproducing the anomaly detection stage and removing suspicious heartbeats is publicly available in the repository [Książek et al]. The structure of consecutive files collecting R-R intervals is following:

Phone timestamp RR-interval [ms]
12:43:26.538000 651
12:43:27.189000 632
12:43:27.821000 618
12:43:28.439000 621
12:43:29.060000 661
... ...

The first column contains the timestamp for which the distance between two consecutive R peaks was registered. The corresponding R-R interval is presented in the second column of the file and is expressed in milliseconds.
The second folder, called accelerometer_anonymized_data contains values of accelerometer data collected at the same time as R-R intervals. The naming convention is similar to that of the R-R interval data: treatment_X.csv and control_X.csv represent the data coming from the persons from the treatment and control group, respectively, while X is the identification number of the selected participant. The numbers are exactly the same as for R-R intervals. The structure of the files with accelerometer recordings is as follows:

Phone timestamp X [mg] Y [mg] Z [mg]
13:00:17.196000 -961 -23 182
13:00:17.205000 -965 -21 181
13:00:17.215000 -966 -22 187
13:00:17.225000 -967 -26 193
13:00:17.235000 -965 -27 191
... ... ... ...

The first column contains a timestamp, while the next three columns correspond to the currently registered acceleration in three axes: X, Y and Z, in milli-g unit.

We also attached a file with the PANSS test scores (PANSS.csv) for all patients participating in the measurement. The structure of this file is as follows:

no_of_person PANSS_P PANSS_N PANSS_G PANSS_total
1 8 13 22 43
2 11 7 18 36
3 14 30 44 88
4 18 13 27 58
... ... ... ... ..

The first column contains the identification number of the patient, while the three following columns refer to the PANSS scores related to positive, negative and general symptoms, respectively.

USAGE NOTES

All the files necessary to run the HRV and/or accelerometer data analysis are available on the GitHub repository [Książek et al]. HRV data loading, preprocessing (i.e. anomaly detection and removal), as well as the
Chapter 9 of the Working Group I Contribution to the IPCC Sixth Assessment...
catalogue.ceda.ac.uk
Updated Feb 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dirk Notz; Jakob Doerr (2023). Chapter 9 of the Working Group I Contribution to the IPCC Sixth Assessment Report - data for Figure 9.15 (v20220712) [Dataset]. https://catalogue.ceda.ac.uk/uuid/65c832a5eeda4ed7a9b0a8af6cf5058d
Explore at:
Dataset updated
Feb 15, 2023
Dataset provided by
Centre for Environmental Data Analysishttp://www.ceda.ac.uk/
Authors
Dirk Notz; Jakob Doerr
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1979 - Dec 31, 2054
Area covered
Earth
Variables measured
time, latitude, longitude, sea_ice_area, sea_ice_area_fraction
Description
Data for Figure 9.15 from Chapter 9 of the Working Group I (WGI) Contribution to the Intergovernmental Panel on Climate Change (IPCC) Sixth Assessment Report (AR6).

Figure 9.15 shows Antarctic sea ice historical records and CMIP6 projections.

How to cite this dataset

When citing this dataset, please include both the data citation below (under 'Citable as') and the following citation for the report component from which the figure originates: Fox-Kemper, B., H.T. Hewitt, C. Xiao, G. Aðalgeirsdóttir, S.S. Drijfhout, T.L. Edwards, N.R. Golledge, M. Hemer, R.E. Kopp, G. Krinner, A. Mix, D. Notz, S. Nowicki, I.S. Nurhati, L. Ruiz, J.-B. Sallée, A.B.A. Slangen, and Y. Yu, 2021: Ocean, Cryosphere and Sea Level Change. In Climate Change 2021: The Physical Science Basis. Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change [Masson-Delmotte, V., P. Zhai, A. Pirani, S.L. Connors, C. Péan, S. Berger, N. Caud, Y. Chen, L. Goldfarb, M.I. Gomis, M. Huang, K. Leitzell, E. Lonnoy, J.B.R. Matthews, T.K. Maycock, T. Waterfield, O. Yelekçi, R. Yu, and B. Zhou (eds.)]. Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA, pp. 1211–1362, doi:10.1017/9781009157896.011.

Figure subpanels

The figure has 2 subpanels, with data provided for both panels.

List of data provided

This dataset contains:

(Left panel) Absolute anomaly of observed monthly mean Antarctic sea ice area during the period 1979–2019 relative to the average monthly mean Antarctic sea ice area during the period 1979–2008.

(Right panel) Sea ice coverage in the Antarctic as given by the average of the three most widely used satellite-based estimates for September and February, which usually are the months of maximum and minimum sea ice coverage, respectively.

First column: Mean sea ice coverage during the decade 1979–1988. Second column: Mean sea ice coverage during the decade 2010–2019. Third column: Absolute change in sea ice concentration between these two decades, with grid lines indicating non-significant differences. Fourth column: Number of available CMIP6 models that simulate a mean sea ice concentration above 15% for the decade 2045–2054.

The average observational record of sea ice area is derived from the UHH sea ice area product (Doerr et al., 2021), based on the average sea ice concentration of OSISAF/CCI (OSI-450 for 1979–2015, OSI-430b for 2016–2019) (Lavergne et al., 2019), NASA Team (version 1, 1979–2019) (Cavalieri et al., 1996) and Bootstrap (version 3, 1979–2019) (Comiso, 2017) that is also used for the figure panels showing observed sea ice concentration.

Further details on data sources and processing are available in the chapter data table (Table 9.SM.9).

Data provided in relation to figure

Data provided in relation to Figure 9.15

Data file: NSIDC_polehole_big.nc

Data file: NSIDC_polehole_small.nc

Data file: SeaIceArea_NorthernHemisphere_monthly_UHH_v2019_fv0.01.nc

Data file: SeaIceArea_SouthernHemisphere_monthly_UHH_v2019_fv0.01.nc

Data file: cryo_div.txt

Data file: cryo_seq.txt

Datafile 'mapplot_data.npz' included in the 'Plotted Data' folder of the GitHub repository is not archived here but on Zenodo at the link provided in the Related Documents section of this catalogue record.

CMIP6 is the sixth phase of the Coupled Model Intercomparison Project. NSIDC is the National Snow and Ice Data Center. UHH is the University of Hamburg (Universität Hamburg).

Notes on reproducing the figure from the provided data

Both panels were plotted using standard matplotlib software - code is available via the link in the documentation.

Sources of additional information

The following weblinks are provided in the Related Documents section of this catalogue record: - Link to the figure on the IPCC AR6 website - Link to the report component containing the figure (Chapter 9) - Link to the Supplementary Material for Chapter 9, which contains details on the input data used in Table 9.SM.9 - Link to the data and code used to produce this figure and others in Chapter 9, archived on Zenodo. - Link to the code and output data for this figure, contained in a dedicated GitHub repository.

Facebook

Twitter

Click to copy link

Link copied

Cite

Datasimple (2025). Turkish Natural Language Inference Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/f4951f96-ebbc-43bf-bed5-36dce9796e6e

Turkish Natural Language Inference Dataset

Explore at:

.undefinedAvailable download formats

Dataset updated

Jul 5, 2025

Dataset authored and provided by

Datasimple

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Area covered

Education & Learning Analytics

Description

The NLI-TR dataset, comprising two distinct datasets known as SNLI-TR and MNLI-TR, provides an unparalleled opportunity for research within the natural language processing (NLP) and machine learning communities. Its primary purpose is to facilitate natural language inference research in the Turkish language. The datasets consist of meticulously curated natural language inference data, which has been carefully translated into Turkish from original English sources. This resource enables researchers to develop automated models specifically tailored for making inferences on texts in this vibrant language. Furthermore, it offers valuable insights into cross-lingual generalisation capabilities, allowing investigation into how models trained on data from one language perform when applied to another. It supports tasks ranging from sentence paraphrasing and classification to question answering scenarios, featuring Turkish sentences labelled to indicate whether a premise and hypothesis entail, contradict, or are neutral towards each other.

Columns

The dataset records typically include the following columns:

premise: This column contains sentences written in Turkish. These sentences have been translated from the English sources used for the original SNLI and MNLI datasets. It serves as the contextual information or the initial statement from which an inference is to be made.
hypothesis: This column also contains sentences in Turkish, translated from the English SNLI and MNLI datasets. It represents the conclusion or the statement whose relationship to the premise is being assessed.
label: This column assigns a relationship between the premise and hypothesis. Possible values include:
- 'entailment': The hypothesis logically follows from the premise.
- 'contradiction': The hypothesis directly contradicts the premise.
- 'neutral': The hypothesis is unrelated to or neither entails nor contradicts the premise.
domain: An optional column assigned by some authors, primarily used when inferences are made between sentences across different semantic domains, such as weather, sports, or finance.

Distribution

The data is typically provided in CSV file format. It includes both training and validation sets to support model development and evaluation. Key files mentioned are SNLI_tr_train.csv for training models, slni_tr_validation for testing or validating model accuracy on unseen data, and multinli_tr_validation_{matched / mismatched}.csv for additional validation on complex scenarios. The multinli_tr_train.csv file contains Turkish sentences with their corresponding labels. The dataset is considered large-scale, with the multinli_tr_train.csv file, for instance, containing approximately 392,700 records.

Usage

This dataset is ideal for various applications and use cases in NLP and machine learning:

Developing Natural Language Inference (NLI)-based question answering systems for the Turkish language.
Training sentiment analysis algorithms to discern sentiment in Turkish text.
Building Machine Learning Chatbots that leverage NLI to understand conversational context and respond appropriately in Turkish.
Conducting general NLI research in Turkish.
Investigating cross-lingual generalisation capabilities of NLP models.
Tasks such as sentence paraphrasing, classification, and other NLP techniques applied to Turkish text.

Coverage

The dataset's scope is primarily focused on the Turkish language, making it relevant for global use. The data has been translated from English sources, expanding its utility for cross-lingual studies. A specific time range or demographic scope for the data collection is not detailed in the available sources.

License

CC0

Who Can Use It

The NLI-TR dataset is intended for a broad audience interested in natural language processing and machine learning, including:

The natural language processing (NLP) community.
The machine learning community.
Seasoned and budding researchers looking to delve into NLI tasks.
Developers aiming to create automated models for Turkish language inference.
Academics and practitioners exploring the cross-lingual generalisation capabilities of models.
Anyone working on NLP tasks in Turkish, such as sentence paraphrasing, text classification, or question answering.

Dataset Name Suggestions

NLI-TR (Turkish NLI Research)
Turkish Natural Language Inference Dataset
SNLI-TR and MNLI-TR Turkish Data
Turkish Textual Entailment Data

Attributes

Original Data Source: NLI-TR (Turkish NLI Research)

Clear search

Close search

Google apps

Main menu

Phone timestamp	RR-interval [ms]
12:43:26.538000	651
12:43:27.189000	632
12:43:27.821000	618
12:43:28.439000	621
12:43:29.060000	661
...	...

Phone timestamp	X [mg]	Y [mg]	Z [mg]
13:00:17.196000	-961	-23	182
13:00:17.205000	-965	-21	181
13:00:17.215000	-966	-22	187
13:00:17.225000	-967	-26	193
13:00:17.235000	-965	-27	191
...	...	...	...

no_of_person	PANSS_P	PANSS_N	PANSS_G	PANSS_total
1	8	13	22	43
2	11	7	18	36
3	14	30	44	88
4	18	13	27	58
...	...	...	...	..

Turkish Natural Language Inference Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

College Placement Predictor Dataset

1. About the Dataset:

2. Projects Ideas:

3. Columns Explanation:

Data from: A Multi-Pathology Ballistocardiogram Dataset for Cardiac Function...

Data from: A large synthetic dataset for machine learning applications in...

Data generation algorithm

Network

Time series

Usage

Selecting a particular country

Averaging over time

Source code

Funding

‘San Francisco Citywide Performance Metrics’ analyzed by Analyst-2

NLP Preprocessed Sentiment Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

DEEM Panel Survey

OMI/Aura Level 2 Sulphur Dioxide (SO2) Trace Gas Column Data 1-Orbit Subset...

Data from: Collaborative AI In the Workplace: Enhancing Organizational...

AI2 ARC - Advanced Science Question

AI2 ARC - Advanced Science Question

Promoting research in advanced question-answering

About this dataset

Research Ideas

Acknowledgements

License

Columns

Carbon dioxide data from 2002 to present derived from satellite observations...

AASG Wells Data for the EGS Test Site Planning and Analysis Task...

English-ASL Language Interoperability Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Iris Species Dataset and Database

Iris Flower Dataset

Dataset for Kinetic assessment of straight punch strike techniques: rear...

General Description:

File Naming Convention:

Column Descriptions:

Data from: EEG functional connectivity analysis for the study of the brain...

Apprenticeship

Sentinel-5P TROPOMI Near-Real-Time (NRT) Tropospheric Ozone Column V2...

HRV-ACC: a dataset with R-R intervals and accelerometer data for the...

Chapter 9 of the Working Group I Contribution to the IPCC Sixth Assessment...

Turkish Natural Language Inference Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes