CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The NLI-TR dataset, comprising two distinct datasets known as SNLI-TR and MNLI-TR, provides an unparalleled opportunity for research within the natural language processing (NLP) and machine learning communities. Its primary purpose is to facilitate natural language inference research in the Turkish language. The datasets consist of meticulously curated natural language inference data, which has been carefully translated into Turkish from original English sources. This resource enables researchers to develop automated models specifically tailored for making inferences on texts in this vibrant language. Furthermore, it offers valuable insights into cross-lingual generalisation capabilities, allowing investigation into how models trained on data from one language perform when applied to another. It supports tasks ranging from sentence paraphrasing and classification to question answering scenarios, featuring Turkish sentences labelled to indicate whether a premise and hypothesis entail, contradict, or are neutral towards each other.
The dataset records typically include the following columns:
The data is typically provided in CSV file format. It includes both training and validation sets to support model development and evaluation. Key files mentioned are SNLI_tr_train.csv
for training models, slni_tr_validation
for testing or validating model accuracy on unseen data, and multinli_tr_validation_{matched / mismatched}.csv
for additional validation on complex scenarios. The multinli_tr_train.csv
file contains Turkish sentences with their corresponding labels. The dataset is considered large-scale, with the multinli_tr_train.csv
file, for instance, containing approximately 392,700 records.
This dataset is ideal for various applications and use cases in NLP and machine learning:
The dataset's scope is primarily focused on the Turkish language, making it relevant for global use. The data has been translated from English sources, expanding its utility for cross-lingual studies. A specific time range or demographic scope for the data collection is not detailed in the available sources.
CC0
The NLI-TR dataset is intended for a broad audience interested in natural language processing and machine learning, including:
Original Data Source: NLI-TR (Turkish NLI Research)
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description: Dive into the world of college placements with this dataset designed to unravel the factors influencing student placement outcomes. The dataset comprises crucial parameters such as IQ scores, CGPA (Cumulative Grade Point Average), and placement status. Aspiring data scientists, researchers, and enthusiasts can leverage this dataset to uncover patterns and insights that contribute to a deeper understanding of successful college placements.
Project Idea 1: Predictive Modeling for College Placements Utilize machine learning algorithms to build a predictive model that forecasts a student's likelihood of placement based on their IQ scores and CGPA. Evaluate and compare the effectiveness of different algorithms to enhance prediction accuracy.
Project Idea 2: Feature Importance Analysis Conduct a feature importance analysis to identify the key factors that significantly influence placement outcomes. Gain insights into whether IQ, CGPA, or a combination of both plays a more dominant role in determining success.
Project Idea 3: Clustering Analysis of Placement Trends Apply clustering techniques to group students based on their placement outcomes. Explore whether distinct clusters emerge, shedding light on common characteristics or trends among students who secure placements.
Project Idea 4: Correlation Analysis with External Factors Investigate the correlation between the provided data (IQ, CGPA, placement) and external factors such as internship experience, extracurricular activities, or industry demand. Assess how these external factors may complement or influence placement success.
Project Idea 5: Visualization of Placement Dynamics Over Time Create dynamic visualizations to illustrate how placement trends evolve over time. Analyze trends, patterns, and fluctuations in placement rates to identify potential cyclical or seasonal influences on student placements.
IQ:
CGPA:
Placement:
These columns collectively provide a comprehensive snapshot of a student's intellectual abilities, academic performance, and their success in securing a placement. Analyzing this dataset can offer valuable insights into the dynamics of college placements and inform strategies for optimizing student outcomes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises 153 subfolders within a primary directory named data, derived from 85 participants. Each participant typically contributes 2–3 subfolders, contingent on the completeness and quality of their M-mode echocardiography (UCG) recordings. Subfolder names follow the format: hdata + SubjectID + EJ/XJ/ZJ to denote the specific cardiac region captured in the ultrasound data:EJ denotes M-mode imaging of the mitral valve, XJ denotes M-mode imaging of the left ventricle, and ZJ denotes M-mode imaging of the aortic valve.For instance, a participant with identifier “001” may have subfolders named hdata1EJ, hdata1XJ, and/or hdata1ZJ, corresponding to each available M-mode echocardiographic segment. Each subfolder contains five distinct files, described in detail below.1 BCG J-peak file(1) File name: hdata+subjectID+EJ/XJ/ZJ_BCG.csv(2) Content: J-peak positions in the BCG signal, presented in two columns:(3) The first column provides the raw data point index.(4) The second column specifies the corresponding time (in seconds) for each J-peak.2 ECG R-peak file(1) File name: hdata+subjectID+EJ/XJ/ZJ_ECG.csv(2) Content: R-peak positions in the ECG signal, also in two columns:(3) The first column provides the raw data point index.(4) The second column specifies the corresponding time (in seconds) for each R-peak.3 Ultrasound video(1) File name: hdata+subjectID+EJ/XJ/ZJ_UCG.AVI(2) Content: An AVI-format video of the simultaneously acquired M-mode echocardiogram. The suffix EJ, XJ, or ZJ indicates whether the imaging targeted the mitral valve, left ventricle, or aortic valve, respectively.4 Signal data(1) File name: signal.csv(2) Content: Three columns of time-series data sampled at 100 Hz. Raw BCG signal (Column 1).ECG data (Lead V2 or another designated lead) (Column 2). Denoised BCG signal (Column 3), derived using the Enhanced Singular Value Thresholding (ESVT) algorithm.5 Signal visualization(1) File name: signal.pdf(2) Content: A graphical representation of the signals from signal.csv. This file facilitates quick inspection of waveform alignment and overall signal quality.In addition to the data directory, an Additional_info folder provides participant demographic and clinical details. Each row in subject_info.csv corresponds to an individual participant, listing their ID, sex, weight, height, age, heart rate, ejection fraction(EF) (%). These parameters establish an informative link between each participant’s anthropometric profile, cardiac function metrics, and the corresponding BCG, ECG, and ultrasound data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.
This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.
The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.
The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.
The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.
There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).
The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.
This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):
import pandas as pd
CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)
The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:
CH_gens_list = CH_gens.dropna().squeeze().to_list()
Finally, we can import all the time series of Swiss generators from a given data table with
pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)
The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.
This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:
hourly_loads = pd.read_csv('loads_2018_3.csv')
To get a daily average of the loads, we can use:
daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()
This results in series of length 364. To average further over entire weeks and get series of length 52, we use:
weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()
The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.
This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘San Francisco Citywide Performance Metrics’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/1bcd26b4-c457-4ce5-985c-80a797830e7b on 11 February 2022.
--- Dataset description provided by original source is as follows ---
A. SUMMARY This data set reports key performance metrics for departments and programs in the City and County of San Francisco.
B. HOW THE DATASET IS CREATED City departments report updates about their key metrics to the Controller’s Office. The Controller's Office uses an online application to collect and organize this data. Departments update most metrics once or twice each year. Some metrics may not display data for every year.
C. UPDATE PROCESS Most metrics update twice each year. Updates with results for the first 6 months of each fiscal year are published in the spring, typically between April and May. Updates with results for each full fiscal year are published in the fall, typically in November.
D. HOW TO USE THIS DATASET Each row represents one metric and one fiscal year for a department, with multiple values for each fiscal year. Some metrics do not include values for all fields or fiscal years. Some results for the latest fiscal year are unavailable because of known lags in reporting. Users should review any data notes reported for each row for guidance about interpreting values. All values are reported as numbers without formatting, but the column [Measure Data Type] describes the intended format. For example, a value appearing as “0.50” with [Measure Data Type] reported as “Percent” should be displayed as “50%”.
--- Original source retains full ownership of the source dataset ---
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset is a substantial collection of over 241,000 English-language comments, gathered from various online platforms. Each comment within the dataset has been carefully annotated with a sentiment label: 0 for negative sentiment, 1 for neutral, and 2 for positive. The primary aim of this dataset is to facilitate the training and evaluation of multi-class sentiment analysis models, designed to work effectively with real-world text data. The dataset has undergone a preprocessing stage, ensuring comments are in lowercase, and are cleaned of punctuation, URLs, numbers, and stopwords, making it readily usable for Natural Language Processing (NLP) pipelines.
The dataset comprises over 241,000 records. While the specific file format is not detailed, such datasets are typically provided in a tabular format, often as a CSV file. It is structured with two distinct columns as described above, suitable for direct integration into machine learning workflows.
This dataset is ideally suited for a variety of applications and use cases, including: * Training sentiment classifiers utilising advanced models such as LSTM, BiLSTM, CNN, BERT, or RoBERTa. * Evaluating the efficacy of different preprocessing and tokenisation strategies for text data. * Benchmarking NLP models on multi-class classification tasks to assess their performance. * Supporting educational projects and research initiatives in the fields of opinion mining or text classification. * Fine-tuning transformer models on a large and diverse collection of sentiment-annotated text.
The dataset's coverage is global, comprising English-language comments. It focuses on general user-generated text content without specific demographic notes. The dataset is listed with a version of 1.0.
CC0
This dataset is suitable for individuals and organisations involved in data science and analytics. Intended users include: * Data Scientists and Machine Learning Engineers for developing and deploying sentiment analysis models. * Researchers and Academics for studies in NLP, text classification, and opinion mining. * Students undertaking educational projects in artificial intelligence and machine learning.
Original Data Source: Sentiment Analysis Dataset
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description This dataset contains responses to a yearly panel survey among entrepreneurs in Baden-Württemberg. Based on the DEEM. research project's collected data (see the DEEM. project's website for more information), we survey founders to track the development of startups in our region and to assess the quality and performance of the local Entrepreneurial Ecosystem (see Empirical entrepreneurial ecosystem research: A guide to creating multilevel datasets for more information on this multilevel dataset). Surveys are sent out to all founders of currently active startups. Surveys were made available in German and English, with respondents being able to choose their preferred language at the start of the survey. For any questions about this survey or the underlying research project, please contact us. Aims Research Integrating data on founders, firms, regional contexts and socioeconomic indicators, this data enables deeper insights into patterns and dynamics across different levels of Entrepreneurial Ecosystems (EEs) - insights often missed in traditional single-source and cross-sectional data studies. As such, this data contributes to the understanding of EEs as multilevel phenomena crucial for understanding and promoting productive entrepreneurship and economic development. Respondents We aim for a full population survey every year, instead of drawing samples. This means that all startups with an identifiable means of contact are contacted, with every potential respondent receiving a personalized survey link. Response rates typically vary between 10-15%. To increase response rates, the following approach is used: The survey is left open for a period of two months for founders to answer at their own pace, with periodic reminders sent. While the survey is designed as a panel to track founders' perceptions over time, we cannot guarantee that founders participate in more than one wave. As such, this dataset can be more accurately viewed as a "macro-panel" on the Entrepreneurial Ecosystem of BW. Usage This repository is structured as follows: The global codebook contains information on the broad concepts addressed in each survey wave, as well as the question batteries asked to address these concepts. As such it serves as a broad overview for researchers, to understand whether the data suits their research interests, and whether the relevant questions were asked in multiple years (i.e. panel analyses are possible), or whether they were included as one-off batteries. It is only available in English. The folders include the responses obtained for each survey year, as well as a wave-specific codebook with more detailed information. In contrast to the global codebook, these codebooks contain the questions and response options in both English and German, as well as meta-information about question filters and sub-groups if applicable. Additionally, for each item, basic summary statistics (number of responses per category, number of non-responses) are reported. Data for each survey wave is made available in .csv format (Comma-Separated Values) with a header row. The columns are separated via semicolons (";"). This has been done to avoid conflicts, as some text responses and system variables included commas. Please consider this when loading and using the data with the analysis software of your choice. Should any issues arise in downloading, opening or using this data, please contact us for help.
This is a CloudSat-collocated subset of the original product OMSO2, for the purposes of the A-Train mission. The goal of the subset is to select and return OMI data that are within +/-100 km across the CloudSat track. The resultant OMI subset swath is sought to be about 200 km cross-track of CloudSat. Even though collocated with CloudSat, this subset can serve many other A-Train applications. (The shortname for this CloudSat-collocated subset of the original product OMSO2 Product is OMSO2_CPR_V003) This document describes the original OMI SO2 product (OMSO2) produced from global mode UV measurements of the Ozone Monitoring Instrument (OMI). OMI was launched on July 15, 2004 on the EOS Aura satellite, which is in a sun-synchronous ascending polar orbit with 1:45pm local equator crossing time. The data collection started on August 17, 2004 (orbit 482) and continues to this day with only minor data gaps. The minimum SO2 mass detectable by OMI is about two orders of magnitude smaller than the detection threshold of the legacy Total Ozone Mapping Spectrometer (TOMS) SO2 data (1978-2005) [Krueger et al 1995]. This is due to smaller OMI footprint and the use of wavelengths better optimized for separating O3 from SO2. The product file, called a data granule, covers the sunlit portion of the orbit with an approximately 2600 km wide swath containing 60 pixels per viewing line. During normal operations, 14 or 15 granules are produced daily, providing fully contiguous coverage of the globe. Currently, OMSO2 products are not produced when OMI goes into the "zoom mode" for one day every 452 orbits (~32 days). For each OMI pixel we provide 4 different estimates of the column density of SO2 in Dobson Units (1DU=2.69x10^16 molecules/cm2) obtained by making different assumptions about the vertical distribution of the SO2. However, it is important to note that in most cases the precise vertical distribution of SO2 is unimportant. The users can use either the SO2 plume height, or the center of mass altitude (CMA) derived from SO2 vertical distribution, to interpolate between the 4 values: 1)Planetary Boundary Layer (PBL) SO2 column (ColumnAmountSO2_PBL), corresponding to CMA of 0.9 km. 2)Lower tropospheric SO2 column (ColumnAmountSO2_TRL), corresponding to CMA of 2.5 km. 3)Middle tropospheric SO2 column, (ColumnAmountSO2_TRM), usually produced by volcanic degassing, corresponding to CMA of 7.5 km, 4)Upper tropospheric and Stratospheric SO2 column (ColumnAmountSO2_STL), usually produced by explosive volcanic eruption, corresponding to CMA of 17 km. The accuracy and precision of the derived SO2 columns vary significantly with the SO2 CMA and column amount, observational geometry, and slant column ozone. OMI becomes more sensitive to SO2 above clouds and snow/ice, and less sensitive to SO2 below clouds. Preliminary error estimates are discussed below (see Data Quality Assessment). OMSO2 files are stored in EOS Hierarchical Data Format (HDF-EOS5). Each file contains data from the day lit portion of an orbit (53 minutes). There are approximately 14 orbits per day. The maximum file size for the OMSO2 data product is about 9 Mbytes.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The published data were the basis for the analysis and publication of an article on human-AI collaboration in the workplace. A human-AI collaboration experiment was conducted, in which participants were asked to perform several tasks typical of knowledge-based work. Participants were divided into two groups - cooperating with generative AI and working independently.There two files published in this dataset, containing data for two seperate parts of the study:Collaboration experiment dataTextual dataAd. 1 This dataset includes the results of pre-test (columns T-AM) and post-test surveys (columns AN-AV), participants' responses to tasks (columns (AW-AZ), assessment of these tasks by independent judges (columns D-S). There are 94 observations and 51 variables.Ad. 2 This dataset includes logs of participants' conversations with a bot from the group cooperating with AI. Column A is the message sent by user or bot, column B specifies who sent the message and column C is participants' anonymous username.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By ai2_arc (From Huggingface) [source]
The ai2_arc dataset, also known as the A Challenge Dataset for Advanced Question-Answering in Grade-School Level Science, is a comprehensive and valuable resource created to facilitate research in advanced question-answering. This dataset consists of a collection of 7,787 genuine grade-school level science questions presented in multiple-choice format.
The primary objective behind assembling this dataset was to provide researchers with a powerful tool to explore and develop question-answering models capable of tackling complex scientific inquiries typically encountered at a grade-school level. The questions within this dataset are carefully crafted to test the knowledge and understanding of various scientific concepts in an engaging manner.
The ai2_arc dataset is further divided into two primary sets: the Challenge Set and the Easy Set. Each set contains numerous highly curated science questions that cover a wide range of topics commonly taught at a grade-school level. These questions are designed specifically for advanced question-answering research purposes, offering an opportunity for model evaluation, comparison, and improvement.
In terms of data structure, the ai2_arc dataset features several columns providing vital information about each question. These include columns such as question, which contains the text of the actual question being asked; choices, which presents the multiple-choice options available for each question; and answerKey, which indicates the correct answer corresponding to each specific question.
Researchers can utilize this comprehensive dataset not only for developing advanced algorithms but also for training machine learning models that exhibit sophisticated cognitive capabilities when it comes to comprehending scientific queries from a grade-school perspective. Moreover, by leveraging these meticulously curated questions, researchers can analyze performance metrics such as accuracy or examine biases within their models' decision-making processes.
In conclusion, the ai2_arc dataset serves as an invaluable resource for anyone involved in advanced question-answering research within grade-school level science education. With its extensive collection of genuine multiple-choice science questions spanning various difficulty levels, researchers can delve into the intricate nuances of scientific knowledge acquisition, processing, and reasoning, ultimately unlocking novel insights and innovations in the field
- Developing advanced question-answering models: The ai2_arc dataset provides a valuable resource for training and evaluating advanced question-answering models. Researchers can use this dataset to develop and test algorithms that can accurately answer grade-school level science questions.
- Evaluating natural language processing (NLP) models: NLP models that aim to understand and generate human-like responses can be evaluated using this dataset. The multiple-choice format of the questions allows for objective evaluation of the model's ability to comprehend and provide correct answers.
- Assessing human-level performance: The dataset can be used as a benchmark to measure the performance of human participants in answering grade-school level science questions. By comparing the accuracy of humans with that of AI systems, researchers can gain insights into the strengths and weaknesses of both approaches
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: ARC-Challenge_test.csv | Column name | Description | |:--------------|:--------------------------------------------------------------------------------| | question | The text content of each question being asked. (Text) | | choices | A list of multiple-choice options associated with each question. (List of Text) | | answerKey | The correct answer option (choice) for a particular question. (Text) |
File: ARC-Easy_test.csv | Column name | Description ...
https://object-store.os-api.cci2.ecmwf.int:443/bopen-cds2-stable-catalogue/licences/ghg-cci/ghg-cci_0911d58e24365e15589377902e562c6e9231290f75b14ddc3c7cb5fd09a265af.pdfhttps://object-store.os-api.cci2.ecmwf.int:443/bopen-cds2-stable-catalogue/licences/ghg-cci/ghg-cci_0911d58e24365e15589377902e562c6e9231290f75b14ddc3c7cb5fd09a265af.pdf
This dataset provides observations of atmospheric carbon dioxide (CO₂) amounts obtained from observations collected by several current and historical satellite instruments. Carbon dioxide is a naturally occurring Greenhouse Gas (GHG), but one whose abundance has been increased substantially above its pre-industrial value of some 280 ppm by human activities, primarily because of emissions from combustion of fossil fuels, deforestation and other land-use change. The annual cycle (especially in the northern hemisphere) is primarily due to seasonal uptake and release of atmospheric CO2 by terrestrial vegetation. Atmospheric carbon dioxide abundance is indirectly observed by various satellite instruments. These instruments measure spectrally resolved near-infrared and/or infrared radiation reflected or emitted by the Earth and its atmosphere. In the measured signal, molecular absorption signatures from carbon dioxide and other constituent gasses can be identified. It is through analysis of those absorption lines in these radiance observations that the averaged carbon dioxide abundance in the sampled atmospheric column can be determined. The software used to analyse the absorption lines and determine the carbon dioxide concentration in the sampled atmospheric column is referred to as the retrieval algorithm. For this dataset, carbon dioxide abundances have been determined by applying several algorithms to different satellite instruments. Typically, different algorithms have different strengths and weaknesses and therefore, which product to use for a given application typically depends on the application. The data set consists of 2 types of products:
column-averaged mixing ratios of CO2, denoted XCO2 mid-tropospheric CO2 columns.
The XCO2 products have been retrieved from SCIAMACHY/ENVISAT, TANSO-FTS/GOSAT, TANSO-FTS2/GOSAT2 and OCO-2. The mid-tropospheric CO2 product has been retrieved from the IASI instruments on-board the Metop satellite series and from AIRS. The XCO2 products are available as Level 2 (L2) products (satellite orbit tracks) and as Level 3 (L3) product (gridded). The L2 products are available as individual sensor products (SCIAMACHY: BESD and WFMD algorithms; GOSAT: OCFP and SRFP algorithms) and as a multi-sensor merged product (EMMA algorithm). The L3 XCO2 product is provided in OBS4MIPS format. The IASI and AIRS products are available as L2 products generated with the NLIS algorithm. This data set is updated on a yearly basis, with each update cycle adding (if required) a new data version for the entire period, up to one year behind real time. This dataset is produced on behalf of C3S with the exception of the SCIAMACHY and AIRS L2 products that were generated in the framework of the GHG-CCI project of the European Space Agency (ESA) Climate Change Initiative (CCI).
AASG Wells Data for the EGS Test Site Planning and Analysis Task Temperature measurement data obtained from boreholes for the Association of American State Geologists (AASG) geothermal data project. Typically bottomhole temperatures are recorded from log headers, and this information is provided through a borehole temperature observation service for each state. Service includes header records, well logs, temperature measurements, and other information for each borehole. Information presented in Geothermal Prospector was derived from data aggregated from the borehole temperature observations for all states. For each observation, the given well location was recorded and the best available well identifier (name), temperature and depth were chosen. The "Well Name Source," "Temp. Type" and "Depth Type" attributes indicate the field used from the original service. This data was then cleaned and converted to consistent units. The accuracy of the observation's location, name, temperature or depth was note assessed beyond that originally provided by the service.
Temperature: CorrectedTemperature - best MeasuredTemperature - next best Depth: DepthOfMeasurement - best TrueVerticalDepth - next best DrillerTotalDepth - last option Well Name/Identifier: APINo - best WellName - next best ObservationURI - last option
The column headers are as follows: gid = internal unique ID src_state = the state from which the well was downloaded (note: the low temperature wells in Idaho are coded as "ID_LowTemp", while all other wells are simply the two character state abbreviation) source_url = the url for the source WFS service or Excel file temp_c = "best" temperature in Celsius temp_type = indicates whether temp_c comes from the corrected or measured temperature header column in the source document depth_m = "best" depth in meters depth_type = indicates whether depth_m comes from the measured, true vertical, or driller total depth header column in the source document well_name = "best" well name or ID name_src = indicates whether well_name came from apino, wellname, or observationuri header column in the source document lat_wgs84 = latitude in wgs84 lon_wgs84 = longitude in wgs84 state = state in which the point is located county = county in which the point is located AASG Wells Data for the EGS Test Site Planning and Analysis Task Temperature measurement data obtained from boreholes for the Association of American State Geologists (AASG) geothermal data project. Typically bottomhole temperatures are recorded from log headers, and this information is provided through a borehole temperature observation service for each state. Service includes header records, well logs, temperature measurements, and other information for each borehole. Information presented in Geothermal Prospector was derived from data aggregated from the borehole temperature observations for all states. For each observation, the given well location was recorded and the best available well identified (name), temperature and depth were chosen. The “Well Name Source,” “Temp. Type” and “Depth Type” attributes indicate the field used from the original service. This data was then cleaned and converted to consistent units. The accuracy of the observation’s location, name, temperature or depth was note assessed beyond that originally provided by the service.
• Temperature: • CorrectedTemperature – best • MeasuredTemperature – next best • Depth: • DepthOfMeasurement – best • TrueVerticalDepth – next best • DrillerTotalDepth – last option • Well Name/Identifier • APINo – best • WellName – next best • ObservationURI - last option.
The column headers are as follows:
• gid = internal unique ID
• src_state = the state from which the well was downloaded (note: the low temperature wells in Idaho are coded as “ID_LowTemp”, while all other wells are simply the two character state abbreviation)
• source_url = the url for the source WFS service or Excel file
• temp_c = “best” temperature in Celsius
• temp_type = indicates whether temp_c comes from the corrected or measured temperature header column in the source document
• depth_m = “best” depth in meters
• depth_type = indicates whether depth_m comes from the measured, true vertical, or driller total depth header column in the source document
• well_name = “best” well name or ID
• name_src = indicates whether well_name came from apino, wellname, or observationuri header column in the source document
• lat_wgs84 = latitude in wgs84
• lon_wgs84 = longitude in wgs84
• state = state in which the point is located
• county = county in which the point is located
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset offers a powerful synthetic English-ASL gloss parallel corpus that was generated in 2012, providing an exciting opportunity to bridge the cultural divide between English and American Sign Language. By exploring this cross-cultural language interoperability, it aims to connect linguistic communities and bring together aspects of communication often seen as separated. The data supports innovative approaches to machine translation models and helps to uncover further insights into bridging linguistic divides.
The dataset consists of two primary columns:
The dataset is typically provided in a CSV file format, specifically referenced as train.csv
. It comprises two columns: gloss
and text
. The gloss
column contains 81,123 unique values, while the text
column contains 81,016 unique values. This indicates the dataset consists of approximately 81,123 records.
This dataset can be used for a variety of applications and use cases, including:
The dataset focuses on the linguistic relationship between English and American Sign Language. While specific demographic details are not provided, its general availability is noted as global. The data was generated in 2012, offering a snapshot from that time.
CC0
This dataset is ideal for:
Original Data Source: AslgPc12 (English-ASL Gloss Parallel Corpus 2012)
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This is a classic and very widely used dataset in machine learning and statistics, often serving as a first dataset for classification problems. Introduced by the British statistician and biologist Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems," it is a foundational resource for learning classification algorithms.
Overview:
The dataset contains measurements for 150 samples of iris flowers. Each sample belongs to one of three species of iris:
For each flower, four features were measured:
The goal is typically to build a model that can classify iris flowers into their correct species based on these four features.
File Structure:
The dataset is usually provided as a single CSV (Comma Separated Values) file, often named iris.csv
or similar. This file typically contains the following columns:
Content of the Data:
The dataset contains an equal number of samples (50) for each of the three iris species. The measurements of the sepal and petal dimensions vary between the species, allowing for their differentiation using machine learning models.
How to Use This Dataset:
iris.csv
file.Citation:
When using the Iris dataset, it is common to cite Ronald Fisher's original work:
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179-188.
Data Contribution:
Thank you for providing this classic and fundamental dataset to the Kaggle community. The Iris dataset remains an invaluable resource for both beginners learning the basics of classification and experienced practitioners testing new algorithms. Its simplicity and clear class separation make it an ideal starting point for many data science projects.
If you find this dataset description helpful and the dataset itself useful for your learning or projects, please consider giving it an upvote after downloading. Your appreciation is valuable!
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset offers a detailed look at the biomechanics of boxing punches, specifically the jab and rear cross, by capturing the acceleration of different parts of the arm and the force generated during these movements. The integration of IMU and force plate data provides a rich source of information for analyzing the efficiency, power, and technique of athletes, offering valuable insights for coaches, researchers, and athletes in the field of sports science and physical culture.
This dataset comprises a collection of Excel files containing measurements of right hand pads from various participants. Each file is named to reflect the participant's identifier followed by the specific measurement context (e.g., "participant1_right hand pad.xlsx"). The original dataset included participant surnames, but these have been replaced with unique numerical identifiers (e.g., "participant1", "participant2", etc.) to ensure anonymity. The dataset is intended for use in physical culture and sports science research, providing valuable data for studies on hand measurements and their implications in sports and physical activities.
Files within this dataset follow a standardized naming convention: . The
is a unique numerical identifier assigned to each participant (e.g., "participant1"), and
describes the specific measurement focus of the file (e.g., "right hand pad").
Time: The timestamp or duration associated with each measurement session, indicating when each set of measurements was taken, typically essential for analyzing movement or force over time.
1x, 1y, 1z: Acceleration data (in milli-g) for the IMU placed on the fist. These columns capture the three-dimensional acceleration of the fist during boxing techniques, with 'x', 'y', and 'z' representing the acceleration along the horizontal, vertical, and depth axes, respectively. This data is crucial for understanding the speed and direction of the punch.
2x, 2y, 2z: Acceleration data (in milli-g) for the IMU on the forearm. Similar to the fist data, these columns provide insights into the forearm's movement dynamics during the execution of boxing techniques, offering a comprehensive view of the arm's acceleration.
3x, 3y, 3z: Acceleration data (in milli-g) for the IMU placed on the upper arm. These measurements complement the fist and forearm data, providing a complete picture of the arm's acceleration and movement patterns during different boxing punches.
fx, fy, fz: Force measurements (in Newtons) from the force plate. These columns represent the force exerted in the x (horizontal), y (vertical), and z (depth or forward/backward) directions. Force plate data is essential for analyzing the power and effectiveness of boxing techniques, as well as the athlete's balance and stability during the punch execution.
Files are related to code on github : https://github.com/Dareczin/boxing_biomechanics
Based on this code, data is saved in two folders. Original data capture (5 strikes in one measurement) and files after computing, to extract each event (strike) as separate file.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset is related to 146 typically developing infants who underwent baseline electrophysiological (EEG) data recording at 6 (T6) and 12 (T12) months of age. The recordings were made using a dense-array EGI system (Geodesic EEG System (GES) 300 or 400, Electric Geodesic, In., Eugene, Oregon, USA) equipped with 60/64-electrode or 128-electrode caps (HydroCel Geodesic Sensor net).
Whole-brain functional connectivity (FC) metrics were extracted, for multiple frequency bands with the aim to evaluate brain maturation in the first year of life in terms of EEG functional connectivity. In addition, Bayley test was administered at 24 months of age to explore possible relation between brain network connectivity and cognitive functions.
The dataset includes subjects’ sociodemographic and individual factors (such as age, sex, socioeconomic status, gestational week and birth weight); functional connectivity metrics (such as the magnitude-squared coherence index, phase lag index (PLI), and parameters characterizing the minimum spanning tree built from the PLI index) computed in the delta (2-4 Hz), theta (4-6 Hz), low-alpha (6-9-Hz), high-alpha (9-13 Hz), beta (13-30 Hz) and gamma (30-45 Hz) frequency bands; and the Bayley test raw scores, assessed at 24 months of age.
In particular, each row in the database corresponds to a subject and each column to a different variable.
Column A, Subject code;
Column B, Time point: 1 = data related to the EEG recording performed at six months of age; 2 = data related to the EEG recording performed at twelve months of age;
Column C, Sex: 0= males; 1=females;
Column D, Age (expressed in days) at T6;
Column E, Age (expressed in days) at T12;
Column F, Family socio-economic status;
Column G, Gestational age expressed in weeks;
Column H, Birth weight expressed in grams;
Column I and J, Bayley Cognitive Composite Score and Griffiths developmental quotient both assessed at 6 months of age;
Column K and L, Number of electrodes of the used electrode-caps for T6 and T12, respectively;
Columns from M to BM, FC metrics for all frequency bands;
Columns from BO to BQ, raw cognitive, receptive and expressive Bayley test scores assessed at 24 months of age;
Column BR, Composite language metric derived from the expressive and receptive Bayley scores.
If you use this dataset please cite the following manucript: Falivene, A.; Cantiani, C.; Dondena, C.; Riboldi, E.M.; Riva, V.; Piazza, C. EEG Functional Connectivity Analysis for the Study of the Brain Maturation in the First Year of Life. Sensors 2024, 24, 4979. All details about subjects, data acquisition and signal processing pipeline are described in the manuscript.
ACTIVE APPRENTICESHIP PROGRAM PARTICIPANTS, APPRENTICESHIP SPONSORS & APPROVED TRAINING DELIVERY AGENTS (TDA) BY TRADE - PROVINCE OF ONTARIO
Demographic data for those enrolled in Ontario apprenticeship programs. The data covers 158 trades.
For each trade, the data includes:
sector Red Seal (yes/no) total number of participants gender age cohort number of approved sponsors number of public training delivery agents number of private training delivery agents
A program participant is an individual who is active in an apprenticeship training program for a specific trade. To be included in the count, program participants must have a training agreement with a sponsor that is currently registered with the Ministry or was in a registered status within the last 12 months. Program participants currently without registered training agreements are usually between apprenticeship jobs or attending classroom training.
A sponsor is responsible for an apprentice’s on-the-job training. Sponsors are typically employers, unions or local apprenticeship committees. Because one sponsor may sponsor apprentices in multiple trades, this column cannot be summed to arrive at the total number of unique apprenticeship sponsors.
Training delivery agents (TDAs) are approved by the Ministry to deliver the classroom training component of apprenticeship programs. Because one TDA may deliver classroom training for multiple trades, this column cannot be summed to arrive at the total number of unique TDAs.
Explanation of Dataset Column Headings:
TRADE SECTOR: Trades are grouped into four sectors: Construction, Industrial, Motive Power, and Service.
REDSEAL = Red Seal certification in a trade means the holder can work in any province that participates in the Interprovincial Red Seal Program without further assessment or testing.
TOTAL PARTICIPANTS = Total number of apprenticeship program participants in the trade; where the number for of total participants is less than 20, the actual number is not displayed to protect the privacy of individual program participants.
MALE/FEMALE: Where the number for one of the genders is less than 20, actual numbers are not displayed to protect the privacy of individual program participants. Instead, the information will display as either “< 20” or “> 20”.
Column headings for the age cohort for participants:
• AGE UNDER 20 = Under 20 years of age
• AGE 20-29 = Between 20 and 29
• AGE 30-44 = Between 30 and 44
• AGE 45-54 = Between 45 and 54
• AGE 55PLUS = Over 55 years of age
PUBLIC TDAS = Training delivery agents that are Colleges of Applied Arts and Technology; Institutes of Technology and Advanced Learning
PRIVATE TDAS = Training delivery agents funded privately, including union-run training centres, private career colleges and employer-run training centres. * Unique Counts cannot be derived by summing these columns. Unique totals are provided on the Grand Total Summary Line.
Notes:
When appropriate, counts of remaining age ranges in the same trade, where providing those values allow the suppressed value to be calculated,
were also suppressed for privacy reasons and indicated with an ">20".
2.As one employer may sponsor apprentices in multiple trades, this column cannot be summed to arrive at the total number of unique apprenticeship sponsors.
3.Training Delivery Agents (TDAs) are approved by the Ministry to deliver in the in-class component of apprenticeship program.
4.As one TDA may deliver classroom training for multiple trades, this column cannot be summed to arrive at the total number of unique TDAs.
The Sentinel-5P TROPOMI Near Real Time (NRT) Tropospheric Ozone Column V2 (S5P_L2_O3_TCL_NRT) at GES DISC is the near real time version of the offline S5P_L2_O3_TCL product. These data are typically available within three hours of measurement as required by the Land Atmosphere NRT Capability Earth Observing System (LANCE). These data are intended for a rapid turnaround assessment and are only archived for up to ten days. Users who require a longer data record, or wish to conduct rigorous analysis should use the offline version of this product S5P_L2_O3_TCL. The Copernicus Sentinel-5 Precursor (Sentinel-5P or S5P) satellite mission is one of the European Space Agency's (ESA) new mission family - Sentinels, and it is a joint initiative between the Kingdom of the Netherlands and the ESA. The sole payload on Sentinel-5P is the TROPOspheric Monitoring Instrument (TROPOMI), which is a nadir-viewing 108 degree Field-of-View push-broom grating hyperspectral spectrometer, covering the wavelength of ultraviolet-visible (UV-VIS, 270nm to 495nm), near infrared (NIR, 675nm to 775nm), and shortwave infrared (SWIR, 2305nm-2385nm). Sentinel-5P is the first of the Atmospheric Composition Sentinels and is expected to provide measurements of ozone, NO2, SO2, CH4, CO, formaldehyde, aerosols and cloud at high spatial, temporal and spectral resolutions. Copernicus Sentinel-5P tropospheric ozone data products are retrieved by the convective-cloud-differential (CCD) algorithm to derive the tropospheric ozone columns and by the cloud slicing algorithm (CSA) to derive mean upper tropospheric ozone volume mixing ratios above the clouds. The S5P_TROPOZ_CCD algorithm uses TROPOMI Level-2 ozone column measurements and the cloud parameters provided by the S5P_CLOUD_OCRA and S5P_CLOUD_ROCINN, the average values of the tropospheric ozone columns below 270 hpa can be determined. The S5P_TROPOZ_CSA algorithm uses the correlation between could top pressure and the ozone column above the cloud. The retrieval depends on the amount of measurements with a high cloud cover. The products are restricted in the tropical region (-20 degrees to 20 degrees of latitude). The main outputs of the Copernicus S5P/TROPOMI tropospheric ozone product include the tropospheric ozone column and corresponding errors, upper tropospheric ozone and corresponding errors, stratospheric ozone column and corresponding errors, and the retrieval quality flags. The data are stored in an enhanced netCDF-4 format. Data are stored in individual files, or granules, that contain one orbit of information. The data are stored in netCDF4 data format and files and complete files are ~20 MB.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT
The issue of diagnosing psychotic diseases, including schizophrenia and bipolar disorder, in particular, the objectification of symptom severity assessment, is still a problem requiring the attention of researchers. Two measures that can be helpful in patient diagnosis are heart rate variability calculated based on electrocardiographic signal and accelerometer mobility data. The following dataset contains data from 30 psychiatric ward patients having schizophrenia or bipolar disorder and 30 healthy persons. The duration of the measurements for individuals was usually between 1.5 and 2 hours. R-R intervals necessary for heart rate variability calculation were collected simultaneously with accelerometer data using a wearable Polar H10 device. The Positive and Negative Syndrome Scale (PANSS) test was performed for each patient participating in the experiment, and its results were attached to the dataset. Furthermore, the code for loading and preprocessing data, as well as for statistical analysis, was included on the corresponding GitHub repository.
BACKGROUND
Heart rate variability (HRV), calculated based on electrocardiographic (ECG) recordings of R-R intervals stemming from the heart's electrical activity, may be used as a biomarker of mental illnesses, including schizophrenia and bipolar disorder (BD) [Benjamin et al]. The variations of R-R interval values correspond to the heart's autonomic regulation changes [Berntson et al, Stogios et al]. Moreover, the HRV measure reflects the activity of the sympathetic and parasympathetic parts of the autonomous nervous system (ANS) [Task Force of the European Society of Cardiology the North American Society of Pacing Electrophysiology, Matusik et al]. Patients with psychotic mental disorders show a tendency for a change in the centrally regulated ANS balance in the direction of less dynamic changes in the ANS activity in response to different environmental conditions [Stogios et al]. Larger sympathetic activity relative to the parasympathetic one leads to lower HRV, while, on the other hand, higher parasympathetic activity translates to higher HRV. This loss of dynamic response may be an indicator of mental health. Additional benefits may come from measuring the daily activity of patients using accelerometry. This may be used to register periods of physical activity and inactivity or withdrawal for further correlation with HRV values recorded at the same time.
EXPERIMENTS
In our experiment, the participants were 30 psychiatric ward patients with schizophrenia or BD and 30 healthy people. All measurements were performed using a Polar H10 wearable device. The sensor collects ECG recordings and accelerometer data and, additionally, prepares a detection of R wave peaks. Participants of the experiment had to wear the sensor for a given time. Basically, it was between 1.5 and 2 hours, but the shortest recording was 70 minutes. During this time, evaluated persons could perform any activity a few minutes after starting the measurement. Participants were encouraged to undertake physical activity and, more specifically, to take a walk. Due to patients being in the medical ward, they received instruction to take a walk in the corridors at the beginning of the experiment. They were to repeat the walk 30 minutes and 1 hour after the first walk. The subsequent walks were to be slightly longer (about 3, 5 and 7 minutes, respectively). We did not remind or supervise the command during the experiment, both in the treatment and the control group. Seven persons from the control group did not receive this order and their measurements correspond to freely selected activities with rest periods but at least three of them performed physical activities during this time. Nevertheless, at the start of the experiment, all participants were requested to rest in a sitting position for 5 minutes. Moreover, for each patient, the disease severity was assessed using the PANSS test and its scores are attached to the dataset.
The data from sensors were collected using Polar Sensor Logger application [Happonen]. Such extracted measurements were then preprocessed and analyzed using the code prepared by the authors of the experiment. It is publicly available on the GitHub repository [Książek et al].
Firstly, we performed a manual artifact detection to remove abnormal heartbeats due to non-sinus beats and technical issues of the device (e.g. temporary disconnections and inappropriate electrode readings). We also performed anomaly detection using Daubechies wavelet transform. Nevertheless, the dataset includes raw data, while a full code necessary to reproduce our anomaly detection approach is available in the repository. Optionally, it is also possible to perform cubic spline data interpolation. After that step, rolling windows of a particular size and time intervals between them are created. Then, a statistical analysis is prepared, e.g. mean HRV calculation using the RMSSD (Root Mean Square of Successive Differences) approach, measuring a relationship between mean HRV and PANSS scores, mobility coefficient calculation based on accelerometer data and verification of dependencies between HRV and mobility scores.
DATA DESCRIPTION
The structure of the dataset is as follows. One folder, called HRV_anonymized_data contains values of R-R intervals together with timestamps for each experiment participant. The data was properly anonymized, i.e. the day of the measurement was removed to prevent person identification. Files concerned with patients have the name treatment_X.csv, where X is the number of the person, while files related to the healthy controls are named control_Y.csv, where Y is the identification number of the person. Furthermore, for visualization purposes, an image of the raw RR intervals for each participant is presented. Its name is raw_RR_{control,treatment}_N.png, where N is the number of the person from the control/treatment group. The collected data are raw, i.e. before the anomaly removal. The code enabling reproducing the anomaly detection stage and removing suspicious heartbeats is publicly available in the repository [Książek et al]. The structure of consecutive files collecting R-R intervals is following:
Phone timestamp | RR-interval [ms] |
12:43:26.538000 | 651 |
12:43:27.189000 | 632 |
12:43:27.821000 | 618 |
12:43:28.439000 | 621 |
12:43:29.060000 | 661 |
... | ... |
The first column contains the timestamp for which the distance between two consecutive R peaks was registered. The corresponding R-R interval is presented in the second column of the file and is expressed in milliseconds.
The second folder, called accelerometer_anonymized_data contains values of accelerometer data collected at the same time as R-R intervals. The naming convention is similar to that of the R-R interval data: treatment_X.csv and control_X.csv represent the data coming from the persons from the treatment and control group, respectively, while X is the identification number of the selected participant. The numbers are exactly the same as for R-R intervals. The structure of the files with accelerometer recordings is as follows:
Phone timestamp | X [mg] | Y [mg] | Z [mg] |
13:00:17.196000 | -961 | -23 | 182 |
13:00:17.205000 | -965 | -21 | 181 |
13:00:17.215000 | -966 | -22 | 187 |
13:00:17.225000 | -967 | -26 | 193 |
13:00:17.235000 | -965 | -27 | 191 |
... | ... | ... | ... |
The first column contains a timestamp, while the next three columns correspond to the currently registered acceleration in three axes: X, Y and Z, in milli-g unit.
We also attached a file with the PANSS test scores (PANSS.csv) for all patients participating in the measurement. The structure of this file is as follows:
no_of_person | PANSS_P | PANSS_N | PANSS_G | PANSS_total |
1 | 8 | 13 | 22 | 43 |
2 | 11 | 7 | 18 | 36 |
3 | 14 | 30 | 44 | 88 |
4 | 18 | 13 | 27 | 58 |
... | ... | ... | ... | .. |
The first column contains the identification number of the patient, while the three following columns refer to the PANSS scores related to positive, negative and general symptoms, respectively.
USAGE NOTES
All the files necessary to run the HRV and/or accelerometer data analysis are available on the GitHub repository [Książek et al]. HRV data loading, preprocessing (i.e. anomaly detection and removal), as well as the
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data for Figure 9.15 from Chapter 9 of the Working Group I (WGI) Contribution to the Intergovernmental Panel on Climate Change (IPCC) Sixth Assessment Report (AR6).
Figure 9.15 shows Antarctic sea ice historical records and CMIP6 projections.
How to cite this dataset
When citing this dataset, please include both the data citation below (under 'Citable as') and the following citation for the report component from which the figure originates: Fox-Kemper, B., H.T. Hewitt, C. Xiao, G. Aðalgeirsdóttir, S.S. Drijfhout, T.L. Edwards, N.R. Golledge, M. Hemer, R.E. Kopp, G. Krinner, A. Mix, D. Notz, S. Nowicki, I.S. Nurhati, L. Ruiz, J.-B. Sallée, A.B.A. Slangen, and Y. Yu, 2021: Ocean, Cryosphere and Sea Level Change. In Climate Change 2021: The Physical Science Basis. Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change [Masson-Delmotte, V., P. Zhai, A. Pirani, S.L. Connors, C. Péan, S. Berger, N. Caud, Y. Chen, L. Goldfarb, M.I. Gomis, M. Huang, K. Leitzell, E. Lonnoy, J.B.R. Matthews, T.K. Maycock, T. Waterfield, O. Yelekçi, R. Yu, and B. Zhou (eds.)]. Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA, pp. 1211–1362, doi:10.1017/9781009157896.011.
Figure subpanels
The figure has 2 subpanels, with data provided for both panels.
List of data provided
This dataset contains:
First column: Mean sea ice coverage during the decade 1979–1988. Second column: Mean sea ice coverage during the decade 2010–2019. Third column: Absolute change in sea ice concentration between these two decades, with grid lines indicating non-significant differences. Fourth column: Number of available CMIP6 models that simulate a mean sea ice concentration above 15% for the decade 2045–2054.
The average observational record of sea ice area is derived from the UHH sea ice area product (Doerr et al., 2021), based on the average sea ice concentration of OSISAF/CCI (OSI-450 for 1979–2015, OSI-430b for 2016–2019) (Lavergne et al., 2019), NASA Team (version 1, 1979–2019) (Cavalieri et al., 1996) and Bootstrap (version 3, 1979–2019) (Comiso, 2017) that is also used for the figure panels showing observed sea ice concentration.
Further details on data sources and processing are available in the chapter data table (Table 9.SM.9).
Data provided in relation to figure
Data provided in relation to Figure 9.15
Datafile 'mapplot_data.npz' included in the 'Plotted Data' folder of the GitHub repository is not archived here but on Zenodo at the link provided in the Related Documents section of this catalogue record.
CMIP6 is the sixth phase of the Coupled Model Intercomparison Project. NSIDC is the National Snow and Ice Data Center. UHH is the University of Hamburg (Universität Hamburg).
Notes on reproducing the figure from the provided data
Both panels were plotted using standard matplotlib software - code is available via the link in the documentation.
Sources of additional information
The following weblinks are provided in the Related Documents section of this catalogue record: - Link to the figure on the IPCC AR6 website - Link to the report component containing the figure (Chapter 9) - Link to the Supplementary Material for Chapter 9, which contains details on the input data used in Table 9.SM.9 - Link to the data and code used to produce this figure and others in Chapter 9, archived on Zenodo. - Link to the code and output data for this figure, contained in a dedicated GitHub repository.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The NLI-TR dataset, comprising two distinct datasets known as SNLI-TR and MNLI-TR, provides an unparalleled opportunity for research within the natural language processing (NLP) and machine learning communities. Its primary purpose is to facilitate natural language inference research in the Turkish language. The datasets consist of meticulously curated natural language inference data, which has been carefully translated into Turkish from original English sources. This resource enables researchers to develop automated models specifically tailored for making inferences on texts in this vibrant language. Furthermore, it offers valuable insights into cross-lingual generalisation capabilities, allowing investigation into how models trained on data from one language perform when applied to another. It supports tasks ranging from sentence paraphrasing and classification to question answering scenarios, featuring Turkish sentences labelled to indicate whether a premise and hypothesis entail, contradict, or are neutral towards each other.
The dataset records typically include the following columns:
The data is typically provided in CSV file format. It includes both training and validation sets to support model development and evaluation. Key files mentioned are SNLI_tr_train.csv
for training models, slni_tr_validation
for testing or validating model accuracy on unseen data, and multinli_tr_validation_{matched / mismatched}.csv
for additional validation on complex scenarios. The multinli_tr_train.csv
file contains Turkish sentences with their corresponding labels. The dataset is considered large-scale, with the multinli_tr_train.csv
file, for instance, containing approximately 392,700 records.
This dataset is ideal for various applications and use cases in NLP and machine learning:
The dataset's scope is primarily focused on the Turkish language, making it relevant for global use. The data has been translated from English sources, expanding its utility for cross-lingual studies. A specific time range or demographic scope for the data collection is not detailed in the available sources.
CC0
The NLI-TR dataset is intended for a broad audience interested in natural language processing and machine learning, including:
Original Data Source: NLI-TR (Turkish NLI Research)