Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Replication Package Files
1. Forms.zip: contains the forms used to collect data for the experiment
2. Experiments.zip: contains the participants’ and sandboxers’ experimental task workflow with Newton.
3. Responses.zip: contains the responses collected from participants during the experiments.
4. Analysis.zip: contains the data analysis scripts and results of the experiments.
5. newton.zip: contains the tool we used for the WoZ experiment.
TutorialStudy.pdf: script used in the experiment with and without Newton to be consistent with all participants.
Woz_Script.pdf: script wizard used to maintain consistent Newton responses among the participants.
1. Forms.zip
The forms zip contains the following files:
Demographics.pdf: a PDF form used to collect demographic information from participants before the experiments
Post-Task Control (without the tool).pdf: a PDF form used to collect data from participants about challenges and interactions when performing the task without Newton
Post-Task Newton (with the tool).pdf: a PDF form used to collect data from participants after the task with Newton.
Post-Study Questionnaire.pdf: a PDF form used to collect data from the participant after the experiment.
2. Experiments.zip
The experiments zip contains two types of folders:
exp[participant’s number]-c[number of dataset used for control task]e[number of dataset used for experimental task]. Example: exp1-c2e1 (experiment participant 1 - control used dataset 2, experimental used dataset 1)
sandboxing[sandboxer’s number]. Example: sandboxing1 (experiment with sandboxer 1)
Every experiment subfolder contains:
warmup.json: a JSON file with the results of Newton-Participant interactions in the chat for the warmup task.
warmup.ipynb: a Jupyter notebook file with the participant’s results from the code provided by Newton in the warmup task.
sample1.csv: Death Event dataset.
sample2.csv: Heart Disease dataset.
tool.ipynb: a Jupyter notebook file with the participant’s results from the code provided by Newton in the experimental task.
python.ipynb: a Jupyter notebook file with the participant’s results from the code they tried during the control task.
results.json: a JSON file with the results of Newton-Participant interactions in the chat for the task with Newton.
To load an experiment chat log into Newton, add the following code to the notebook:
import anachat
import json
with open("result.json", "r") as f:
anachat.comm.COMM.history = json.load(f)
Then, click on the notebook name inside Newton chat
Note 1: the subfolder for P6 is exp6-e2c1-serverdied because the experiment server died before we were able to save the logs. We reconstructed them using the notebook newton_remake.ipynb based on the video recording.
Note 2: The sandboxing occurred during the development of Newton. We did not collect all the files, and the format of JSON files is different than the one supported by the attached version of Newton.
3. Responses.zip
The responses zip contains the following files:
demographics.csv: a CSV file containing the responses collected from participants using the demographics form
task_newton.csv: a CSV file containing the responses collected from participants using the post-task newton form.
task_control.csv: a CSV file containing the responses collected from participants using the post-task control form.
post_study.csv: a CSV file containing the responses collected from participants using the post-study control form.
4. Analysis.zip
The analysis zip contains the following files:
1.Challenge.ipynb: a Jupyter notebook file where the perceptions of challenges figure was created.
2.Interactions.py: a Python file where the participants’ JSON files were created.
3.Interactions.Graph.ipynb: a Jupyter notebook file where the participant’s interaction figure was created.
4.Interactions.Count.ipynb: a Jupyter notebook file that counts participants’ interaction with each figure.
config_interactions.py: this file contains the definitions of interaction colors and grouping
interactions.json: a JSON file with the interactions during the Newton task of each participant based on the categorization.
requirements.txt: dependencies required to run the code to generate the graphs and json analysis.
To run the analyses, install the dependencies on python 3.10 with the following command and execute the scripts and notebooks in order.:
pip install -r requirements.txt
5. newton.zip
The newton zip contains the source code of the Jupyter Lab extension we used in the experiments. Read the README.md file inside it for instructions on how to install and run it.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the EHRI-3 project, we are investigating tools and methods that historical researchers and scholars can use to better understand, visualise, and interpret the material held by our partner archives. This dataset accompanies a tutorial exploring a technique called topic modelling in the context of a Holocaust-related historical collection.
We were on the lookout for datasets that would be easily accessible and, for convenience, predominantly in English. One such dataset was the United States Holocaust Memorial Museum’s (USHMM) extensive collection of oral history testimonies, for which there are a considerable number of textual transcripts. The museum’s total collection consists of over 80,703 testimonies, 41,695 of which are available in English, with 2,894 of them listing a transcript.
Since there is not yet a ready-to-download dataset that includes these transcripts, we had to construct our own. Using a web scraping tool, we managed to create a list of the links pointing to the metadata (including transcripts) of the testimonies that were of interest to us. After obtaining the transcript and other metadata of each of these testimonies, we were able to create our dataset and curate it to remove any unwanted entries. For example, we made sure to remove entries with restrictions on access or use. We also removed entries with transcripts that consisted only of some automatically generated headers and entries which turned out to be in languages other than English. The remaining 1,873 transcripts form the corpus of this tutorial — a small, but still decently sized dataset.
The process that we followed to put together this dataset is detailed in the Jupyter Notebook accompanying this post, which can be found in this Github repository.
In this Zenodo upload, the user can find two files, each of them containing a pickled pandas DataFrame that was obtained at a different stage of the tutorial:
"unrestricted_df.pkl" contains 1,946 entries of Oral Testimony transcripts and has five fields (RG_number, text, display_date, conditions_access, conditions_use) "unrestricted_lemmatized_df.pkl" contains 1,873 entries of Oral Testimony transcripts and has six fields (RG_number, text, display_date, conditions_access, conditions_use, lemmas)
Instructions on their intended use can be found in the accompanying Jupyter Notebook.
Credits:
The transcripts that form the corpus in this tutorial were obtained through the United States Holocaust Memorial Museum (USHMM).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the artifacts of our study on how software engineering research papers are shared and interacted with on LinkedIn, a professional social network. This includes:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary dataset and Jupyter notebook for preproduction of UM data within figures presented in McCulloch et al., 2022.
Data is a post-processed extract of the raw dataset for each variable. Data from the raw dataset has been extracted according to the appropriate Martian month, zonally meaned and converted to a \(\sigma\)/pressure coordinate system. This process is the same as is applied to the MCD dataset, which can be seen in the Jupyter notebook.
The notebook provides the code needed to reproduce the figures with the given data. All instructions are detailed within the notebook, including package dependencies and configuration options. Due to licensing, we are only able to provide access to the UM post-processed data, for the MCD dataset please follow the instructions within the notebook.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This resource, configured for execution in connected JupyterHub compute platforms using the CyberGIS-Jupyter for Water (CJW) environment's supported High-Performance Computing (HPC) resources (Expanse or Virtual ROGER) through CyberGIS-Compute Service, helps the modelers to reproduce and build on the results from the VB study (Van Beusekom et al., 2022) as explained by Maghami et el. (2023).
For this purpose, four different Jupyter notebooks are developed and included in this resource which explore the paper goal for four example CAMELS site and a pre-selected period of 60-month simulation to demonstrate the capabilities of the notebooks. The first notebook processes the raw input data from CAMELS dataset to be used as input for SUMMA model. The second notebook utilizes the CJW environment's supported HPC resource (Expanse or Virtual ROGER) through CyberGIS-Compute Service to executes SUMMA model. This notebook uses the input data from first notebook using original and altered forcing, as per further described in the notebook. The third notebook utilizes the outputs from notebook 2 and visualizes the sensitivity of SUMMA model outputs using Kling-Gupta Efficiency (KGE). The fourth notebook, only developed for the HPC environment (and only currently working with Expanse HPC), enables transferring large data from HPC to the scientific cloud service (i.e., CJW) using Globus service integrated by CyberGIS-Compute in a reliable, high-performance and fast way. More information about each Jupyter notebook and a step-by-step instructions on how to run the notebooks can be found in the Readme.md fie included in this resource. Using these four notebooks, modelers can apply the methodology mentioned above to any (one to all) of the 671 CAMELS basins and simulation periods of their choice. As this resource uses HPC, it enables a high-speed running of simulations which makes it suitable for larger simulations (even as large as the entire 671 CAMELS sites and the whole 60-month simulation period used in the paper) practical and much faster than when no HPC is used.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
In many blockchains, e.g., Ethereum, Binance Smart Chain (BSC), the primary representation used for wallet addresses is a hardly memorable 40-digit hexadecimal string. As a result, users often select addresses from their recent transaction history, which enables blockchain address poisoning. The adversary first generates lookalike addresses similar to one with which the victim has previously interacted, and then engages with the victim to “poison” their transaction history. The goal is to have the victim mistakenly send tokens to the lookalike address, as opposed to the intended recipient. We develop a detection system and perform measurements over two years on Ethereum and BSC. We release the detection result dataset, including over 17 million attack attempts on Ethereum and successful payoff transfers. We also provide a jupyter notebook explaining 1) how to access the dataset, 2) how to produce descriptive statistics such as the number of poisoning transfers, and 3) how to manually verify the payoff transfer on Etherscan (BSCscan). This dataset will enable other researchers to validate our results as well as conduct further analysis.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This is an example dataset recorded using version 1.0 of the open-source-hardware OpenAXES IMU. Please see the github repository for more information on the hardware and firmware. Please find the most up-to-date version of this document in the repository
This dataset was recorded using four OpenAXES IMUs mounted on the segments of a robot arm (UR5 by Universal Robots). The robot arm was programmed to perform a calibration movement, then trace a 2D circle or triangle in the air with its tool center point (TCP), and return to its starting position, at four different speeds from 100 mm/s to 250 mm/s. This results in a total of 8 different scenarios (2 shapes times 4 speeds). The ground truth joint angle and TCP position values were obtained from the robot controller. The calibration movement at the beginning of the measurement allows for calculating the exact orientation of the sensors on the robot arm.
The IMUs were configured to send the raw data from the three gyroscope axes and the six accelerometer axes to a PC via BLE with 16 bit resolution per axis and 100 Hz sample rate. Since no data packets were lost during this process, this dataset allows comparing and tuning different sensor fusion algorithms on the recorded raw data while using the ground truth robot data as a reference.
In order to visualize the results, the quaternion sequences from the IMUs were applied to the individual segments of a 3D model of the robot arm. The end of this kinematic chain represents the TCP of the virtual model, which should ideally move along the same trajectory as the ground truth, barring the accuracy of the IMUs. Since the raw sensor data of these measurements is available, the calibration coefficients can also be applied ex-post.
Since there are are 6 joints but only 4 IMUS, some redundancy must be exploited. The redundancy comes from the fact that each IMU has 3 rotational degrees of fredom, but each joint has only one:
q0
and q1
are both derived from the orientation of the "humerus" IMU.q2
is the difference† between the orientation of the "humerus" and "radius" IMUs.q3
is the difference between the orientation of the "radius" and "carpus" IMUs.q4
is the difference between the orientation of the "carpus" and "digitus" IMUs.q5
does not influence the position of the TCP, only its orientation, so it is ignored in the evaluation.R1 * inv(R0)
for two quaternions (or rotations) R0
and R1
. The actual code works a bit differently, but this describes the general principle.measure_raw-2022-09-15/
, one folder per scenario.
In those folders, there is one CSV file per IMU.measure_raw-2022-09-15/robot/
, one CSV and MAT file per scenario.Media
. Videos are stored in git lfs.The file openaxes-example-robot-dataset.ipynb
is provided to play around with the data in the dataset and demonstrate how the files are read and interpreted.
To use the notebook, set up a Python 3 virtual environment and therein install the necessary packets with pip install -r resuirements.txt
.
In order to view the graphs contained in the ipynb file, you will most likely have to trust the notebook beforehand, using the following command:
jupyter trust openaxes-example-robot-dataset.ipynb
Beware: This notebook is not a comprehensive evaluation and any results and plots shown in the file are not necessarily scientifically sound evidence of anything.
The notebook will store intermediate files in the measure_raw-2022-09-15
directory, like the quaternion files calculated by the different filters, or the files containing the reconstructed TCP positions.
All intermediate files should be ignored by the file measure_raw-2022-09-15/.gitignore
.
The generated intermediate files are also provided in the file measure_raw-2022-09-15.tar.bz2
, in case you want to inspect the generated files without running the the notebook.
A number of tools are used in the evaluation notebook. Below is a short overview, but not a complete specification. If you need to understand the input and output formats for each tool, please read the code.
calculate-quaternions.py
is used in the evaluation notebook to compute different attitude estimation filters like Madgwick or VQF on the raw accelerometer and gyroscrope measurements at 100 Hz.madgwick-filter
contains a small C program that applies the original Madgwick filter to a CSV file containing raw measurements and prints the results. It is used by calculate-quaternions.py
.calculate-robot-quaternions.py
calculates a CSV file of quaternions equivalent to the IMU quaternions from a CSV file containing the joint angles of the robot.dsense_vis
mentioned in the notebook is used to calculate the 3D model of the robot arm from quaternions and determine the mounting orientations of the IMUs on the robot arm.
This program will be released at a future date.
In the meantime, the output files of dsense_vis
are provided in the file measure_raw-2022-09-15.tar.bz2
, which contains the complete content of the measure_raw-2022-09-15
directory after executing the whole notebook.
Just unpack this archive and merge its contents with the measure_raw-2022-09-15
directory.
This allows you to explore the reconstructed TCP files for the filters implemented at the time of publication.Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data supports the results presented in the paper "A Comprehensive Study of the Differential Cross Sections for Water-Rare Gas Collisions: Experimental and Theoretical Perspectives". This research encompasses the analysis of the differential cross-section for the excitation of the fundamental ortho and para levels of water molecules by collision with Ne, Ar and Xe. A joint experimental and theoretical study has been undertaken to this end.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains measurements of radio-frequency electromagnetic emissions from a home-built sender module for BB84 quantum key distribution. The goal of these measurements was to evaluate information leakage through this side-channel. This dataset supplements our publication and allows to reproduce our results together with the source code hosted at GitHub (and also on Zenodo via integration with GitHub).The measurements are performed using a magnetic near-field probe, an amplifier and an oscilloscope. The dataset contains raw measured data in the file format output by the oscilloscope. Use our source code to make use of it. Detailed descriptions of measurement procedure can be found in our paper and in the metadata JSON files found within the dataset.
Commented list of datasets
This file lists the datasets that were analyzed and reported on in the paper. The datasets in the list refer to directories here. Note that most of the datasets contain additional files with metadata, which detail where and how the measurements were performed. The mentioned Jupyter notebooks refer to the source code repository https://github.com/XQP-Munich/EmissionSecurityQKD (not included in this dataset). Most of those notebooks output JSON files storing results. The processed JSON files are also included in the source code repository.
In naming of datasets,
Antenna refers to the log-periodic dipole antenna. All datasets that do not contain Antenna
in their name are recorded with the magnetic near-field probe.
Rev1 refers to the initial electronics design, while rev2
refers to the revised electronics design which contains countermeasures aiming to reduce emissions.
Shielding refers to measurements where the device is enclosed in a metallic shielding and the measurement takes place outside the shielding.
Rotation refers to orientation of the magnetic near-field probe at the same spacial location
Datasets collected with near-field probe for Rev1 electronics
Rev1Distance: contains measurements at different distances from the Rev1 electronics performed above the FPGA. The deep learning attack is analyzed in TEMPEST_ATTACK.ipynb
. The amplitude is analyzed in get_raw_data_RMS_amplitude.ipynb
.
Rev12D: different locations on a 2d grid at a constant distance from the electronics. The deep learning attack is analyzed in TEMPEST_ATTACK.ipynb
.
Rev130meas2.5cm: 30 measurements above the FPGA at a hight of 2.5cm. Used to evaluate how much amount of training data affects neural network performance. The deep learning attack is analyzed in notebooks TEMPEST_ATTACK*.ipynb
. In particular, TEMPEST_ATTACK_VARY_TRAINING_DATA.ipynb
is used on this dataset.
Rev1Rotation10deg contains a measurement for varying orientation of the probe at the same location. This is not mentioned in the paper and is only included for completeness. The deep learning attack is analyzed in notebooks TEMPEST_ATTACK*.ipynb
.
Rev1TEMPESTShieldingFPGA Measurements with and without shielding at 4cm above the FPGA.
TEMPEST_ATTACK*.ipynb
.Datasets collected with near-field probe for Rev2 electronics
Rev2Distance contains measurements at different distances from the Rev2 electronics performed above the FPGA.
Rev22D and Rev22Dstart_7_0 contain measurements on a 2d grid performed on the revised electronics. The dataset is split in two directories because the measurement procedure crashed in the middle. This split structure was kept in order to maintain consistency with the automatic metadata.
Rev230meas2.5cm 30 measurements above the FPGA at a hight of 2.5cm. Used to evaluate how much amount of training data affects neural network performance. The deep learning attack is analyzed in notebooks TEMPEST_ATTACK*.ipynb
. In particular, TEMPEST_ATTACK_VARY_TRAINING_DATA.ipynb
is used on this dataset.
Other datasets
BackgroundTuesday background measurement (QKD device is not powered at all) performed with near-field probe on 2022 June 21st.
BackgroundSaturday background measurement (QKD device is not powered at all) performed with near-field probe on 2022 June 11th.
AntennaSpectra Dataset of spectra directly recorded by the oscilloscope. Used to demonstrate ability of telling apart the situation of sending QKD key (standard operation) and having the device turned on but not sending any key at a distance. Analyzed in notebook Comparing_KeyNokey_Measurements.ipynb
.
Rev2ShieldingAntenna Raw amplitude measurements with log-periodic dipole antenna on Rev2 electronics including shielding enclosure, collected at various distances. None of our attacks against this scenario were successful. The dataset represents a challenge to test more advanced attacks using improved data processing.
T1DiabetesGranada
A longitudinal multi-modal dataset of type 1 diabetes mellitus
Documented by:
Rodriguez-Leon, C., Aviles-Perez, M. D., Banos, O., Quesada-Charneco, M., Lopez-Ibarra, P. J., Villalonga, C., & Munoz-Torres, M. (2023). T1DiabetesGranada: a longitudinal multi-modal dataset of type 1 diabetes mellitus. Scientific Data, 10(1), 916. https://doi.org/10.1038/s41597-023-02737-4
Background
Type 1 diabetes mellitus (T1D) patients face daily difficulties in keeping their blood glucose levels within appropriate ranges. Several techniques and devices, such as flash glucose meters, have been developed to help T1D patients improve their quality of life. Most recently, the data collected via these devices is being used to train advanced artificial intelligence models to characterize the evolution of the disease and support its management. The main problem for the generation of these models is the scarcity of data, as most published works use private or artificially generated datasets. For this reason, this work presents T1DiabetesGranada, a open under specific permission longitudinal dataset that not only provides continuous glucose levels, but also patient demographic and clinical information. The dataset includes 257780 days of measurements over four years from 736 T1D patients from the province of Granada, Spain. This dataset progresses significantly beyond the state of the art as one the longest and largest open datasets of continuous glucose measurements, thus boosting the development of new artificial intelligence models for glucose level characterization and prediction.
Data Records
The data are stored in four comma-separated values (CSV) files which are available in T1DiabetesGranada.zip. These files are described in detail below.
Patient_info.csv
Patient_info.csv is the file containing information about the patients, such as demographic data, start and end dates of blood glucose level measurements and biochemical parameters, number of biochemical parameters or number of diagnostics. This file is composed of 736 records, one for each patient in the dataset, and includes the following variables:
Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.
Sex – Sex of the patient. Values: F (for female), masculine (for male)
Birth_year – Year of birth of the patient. Format: YYYY.
Initial_measurement_date – Date of the first blood glucose level measurement of the patient in the Glucose_measurements.csv file. Format: YYYY-MM-DD.
Final_measurement_date – Date of the last blood glucose level measurement of the patient in the Glucose_measurements.csv file. Format: YYYY-MM-DD.
Number_of_days_with_measures – Number of days with blood glucose level measurements of the patient, extracted from the Glucose_measurements.csv file. Values: ranging from 8 to 1463.
Number_of_measurements – Number of blood glucose level measurements of the patient, extracted from the Glucose_measurements.csv file. Values: ranging from 400 to 137292.
Initial_biochemical_parameters_date – Date of the first biochemical test to measure some biochemical parameter of the patient, extracted from the Biochemical_parameters.csv file. Format: YYYY-MM-DD.
Final_biochemical_parameters_date – Date of the last biochemical test to measure some biochemical parameter of the patient, extracted from the Biochemical_parameters.csv file. Format: YYYY-MM-DD.
Number_of_biochemical_parameters – Number of biochemical parameters measured on the patient, extracted from the Biochemical_parameters.csv file. Values: ranging from 4 to 846.
Number_of_diagnostics – Number of diagnoses realized to the patient, extracted from the Diagnostics.csv file. Values: ranging from 1 to 24.
Glucose_measurements.csv
Glucose_measurements.csv is the file containing the continuous blood glucose level measurements of the patients. The file is composed of more than 22.6 million records that constitute the time series of continuous blood glucose level measurements. It includes the following variables:
Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.
Measurement_date – Date of the blood glucose level measurement. Format: YYYY-MM-DD.
Measurement_time – Time of the blood glucose level measurement. Format: HH:MM:SS.
Measurement – Value of the blood glucose level measurement in mg/dL. Values: ranging from 40 to 500.
Biochemical_parameters.csv
Biochemical_parameters.csv is the file containing data of the biochemical tests performed on patients to measure their biochemical parameters. This file is composed of 87482 records and includes the following variables:
Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.
Reception_date – Date of receipt in the laboratory of the sample to measure the biochemical parameter. Format: YYYY-MM-DD.
Name – Name of the measured biochemical parameter. Values: 'Potassium', 'HDL cholesterol', 'Gammaglutamyl Transferase (GGT)', 'Creatinine', 'Glucose', 'Uric acid', 'Triglycerides', 'Alanine transaminase (GPT)', 'Chlorine', 'Thyrotropin (TSH)', 'Sodium', 'Glycated hemoglobin (Ac)', 'Total cholesterol', 'Albumin (urine)', 'Creatinine (urine)', 'Insulin', 'IA ANTIBODIES'.
Value – Value of the biochemical parameter. Values: ranging from -4.0 to 6446.74.
Diagnostics.csv
Diagnostics.csv is the file containing diagnoses of diabetes mellitus complications or other diseases that patients have in addition to type 1 diabetes mellitus. This file is composed of 1757 records and includes the following variables:
Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.
Code – ICD-9-CM diagnosis code. Values: subset of 594 of the ICD-9-CM codes (https://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes).
Description – ICD-9-CM long description. Values: subset of 594 of the ICD-9-CM long description (https://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes).
Technical Validation
Blood glucose level measurements are collected using FreeStyle Libre devices, which are widely used for healthcare in patients with T1D. Abbott Diabetes Care, Inc., Alameda, CA, USA, the manufacturer company, has conducted validation studies of these devices concluding that the measurements made by their sensors compare to YSI analyzer devices (Xylem Inc.), the gold standard, yielding results of 99.9% of the time within zones A and B of the consensus error grid. In addition, other studies external to the company concluded that the accuracy of the measurements is adequate.
Moreover, it was also checked in most cases the blood glucose level measurements per patient were continuous (i.e. a sample at least every 15 minutes) in the Glucose_measurements.csv file as they should be.
Usage Notes
For data downloading, it is necessary to be authenticated on the Zenodo platform, accept the Data Usage Agreement and send a request specifying full name, email, and the justification of the data use. This request will be processed by the Secretary of the Department of Computer Engineering, Automatics, and Robotics of the University of Granada and access to the dataset will be granted.
The files that compose the dataset are CSV type files delimited by commas and are available in T1DiabetesGranada.zip. A Jupyter Notebook (Python v. 3.8) with code that may help to a better understanding of the dataset, with graphics and statistics, is available in UsageNotes.zip.
Graphs_and_stats.ipynb
The Jupyter Notebook generates tables, graphs and statistics for a better understanding of the dataset. It has four main sections, one dedicated to each file in the dataset. In addition, it has useful functions such as calculating the patient age, deleting a patient list from a dataset file and leaving only a patient list in a dataset file.
Code Availability
The dataset was generated using some custom code located in CodeAvailability.zip. The code is provided as Jupyter Notebooks created with Python v. 3.8. The code was used to conduct tasks such as data curation and transformation, and variables extraction.
Original_patient_info_curation.ipynb
In the Jupyter Notebook is preprocessed the original file with patient data. Mainly irrelevant rows and columns are removed, and the sex variable is recoded.
Glucose_measurements_curation.ipynb
In the Jupyter Notebook is preprocessed the original file with the continuous glucose level measurements of the patients. Principally rows without information or duplicated rows are removed and the variable with the timestamp is transformed into two new variables, measurement date and measurement time.
Biochemical_parameters_curation.ipynb
In the Jupyter Notebook is preprocessed the original file with patient data of the biochemical tests performed on patients to measure their biochemical parameters. Mainly irrelevant rows and columns are removed and the variable with the name of the measured biochemical parameter is translated.
Diagnostic_curation.ipynb
In the Jupyter Notebook is preprocessed the original file with patient data of the diagnoses of diabetes mellitus complications or other diseases that patients have in addition to T1D.
Get_patient_info_variables.ipynb
In the Jupyter Notebook it is coded the feature extraction process from the files Glucose_measurements.csv, Biochemical_parameters.csv and Diagnostics.csv to complete the file Patient_info.csv. It is divided into six sections, the first three to extract the features from each of the mentioned files and the next three to add the extracted features to the resulting new file.
Data Usage Agreement
The conditions for use are as follows:
You confirm that you will not attempt to re-identify research participants for any reason, including for re-identification theory research.
You commit to keeping the T1DiabetesGranada dataset confidential and secure and will not redistribute data or Zenodo account credentials.
You will require
This HydroShare resource provides Jupyter Notebooks with instructions and code for accessing and subsetting the NOAA Analysis of Record for Calibration (AORC) Dataset. There are two Jupyter Notebooks
1. AORC_Point_Data_Retrieval.ipynb
2. AORC_Zone_Data_Retrieval.ipynb
The first retrieves data for a point in the area of the US covered, specified using geographic coordinates. The second retrieves data for areas specified via an uploaded polygon shapefile.
These notebooks programmatically retrieve the data from Amazon Web Services (https://registry.opendata.aws/noaa-nws-aorc/), and in the case of shapefile data retrieval average the data over the shapes in the given shapefile.
The notebooks provided are coded to retrieve data from AORC version 1.1 released in ZARR format in December 2023.
The Analysis Of Record for Calibration (AORC) is a gridded record of near-surface weather conditions covering the continental United States and Alaska and their hydrologically contributing areas (https://registry.opendata.aws/noaa-nws-aorc/). It is defined on a latitude/longitude spatial grid with a mesh length of 30 arc seconds (~800 m), and a temporal resolution of one hour. Elements include hourly total precipitation, temperature, specific humidity, terrain-level pressure, downward longwave and shortwave radiation, and west-east and south-north wind components. It spans the period from 1979 across the Continental U.S. (CONUS) and from 1981 across Alaska, to the near-present (at all locations). This suite of eight variables is sufficient to drive most land-surface and hydrologic models and is used as input to the National Water Model (NWM) retrospective simulation. While the original NOAA process generated AORC data in netCDF format, the data has been post-processed to create a cloud optimized Zarr formatted equivalent that NOAA also disseminates.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Codes for single-cell RNA-sequencing analysis of data from mouse oviducts. WT and cKO data were collected from Pgrf/f (WT) or Wnt7aCre/+;Pgrf/f (cKO) at 0.5 days post coitus (dpc) another dataset was collected from C57BL6/J adult mice that were ovariectomized and treated with progesterone (P4) at 1 mg/mouse for 2 or 24 hours compared to vehicle control. The oviducts were dissected into infundibulum+ampulla (InfAmp or IA) and isthmus+UTJ (IsthUTJ or IU) regions.
This is a detailed description of the dataset, a data sheet for the dataset as proposed by Gebru et al.
Motivation for Dataset Creation Why was the dataset created? Embrapa ADD 256 (Apples by Drones Detection Dataset — 256 × 256) was created to provide images and annotation for research on *apple detection in orchards for UAV-based monitoring in apple production.
What (other) tasks could the dataset be used for? Apple detection in low-resolution scenarios, similar to the aerial images employed here.
Who funded the creation of the dataset? The building of the ADD256 dataset was supported by the Embrapa SEG Project 01.14.09.001.05.04, Image-based metrology for Precision Agriculture and Phenotyping, and FAPESP under grant (2017/19282-7).
Dataset Composition What are the instances? Each instance consists of an RGB image and an annotation describing apples locations as circular markers (i.e., presenting center and radius).
How many instances of each type are there? The dataset consists of 1,139 images containing 2,471 apples.
What data does each instance consist of? Each instance contains an 8-bits RGB image. Its corresponding annotation is found in the JSON files: each apple marker is composed by its center (cx, cy) and its radius (in pixels), as seen below:
"gebler-003-06.jpg": [ { "cx": 116, "cy": 117, "r": 10 }, { "cx": 134, "cy": 113, "r": 10 }, { "cx": 221, "cy": 95, "r": 11 }, { "cx": 206, "cy": 61, "r": 11 }, { "cx": 92, "cy": 1, "r": 10 } ],
Dataset.ipynb is a Jupyter Notebook presenting a code example for reading the data as a PyTorch's Dataset (it should be straightforward to adapt the code for other frameworks as Keras/TensorFlow, fastai/PyTorch, Scikit-learn, etc.)
Is everything included or does the data rely on external resources? Everything is included in the dataset.
Are there recommended data splits or evaluation measures? The dataset comes with specified train/test splits. The splits are found in lists stored as JSON files.
| | Number of images | Number of annotated apples | | --- | --- | --- | |Training | 1,025 | 2,204 | |Test | 114 | 267 | |Total | 1,139 | 2,471 |
Dataset recommended split.
Standard measures from the information retrieval and computer vision literature should be employed: precision and recall, F1-score and average precision as seen in COCO and Pascal VOC.
What experiments were initially run on this dataset? The first experiments run on this dataset are described in A methodology for detection and location of fruits in apples orchards from aerial images by Santos & Gebler (2021).
Data Collection Process How was the data collected? The data employed in the development of the methodology came from two plots located at the Embrapa’s Temperate Climate Fruit Growing Experimental Station at Vacaria-RS (28°30’58.2”S, 50°52’52.2”W). Plants of the varieties Fuji and Gala are present in the dataset, in equal proportions. The images were taken during December 13, 2018, by an UAV (DJI Phantom 4 Pro) that flew over the rows of the field at a height of 12 m. The images mix nadir and non-nadir views, allowing a more extensive view of the canopies. A subset from the images was random selected and 256 × 256 pixels patches were extracted.
Who was involved in the data collection process? T. T. Santos and L. Gebler captured the images in field. T. T. Santos performed the annotation.
How was the data associated with each instance acquired? The circular markers were annotated using the VGG Image Annotator (VIA).
WARNING: Find non-ripe apples in low-resolution images of orchards is a challenging task even for humans. ADD256 was annotated by a single annotator. So, users of this dataset should consider it a noisy dataset.
Data Preprocessing What preprocessing/cleaning was done? No preprocessing was applied.
Dataset Distribution How is the dataset distributed? The dataset is available at GitHub.
When will the dataset be released/first distributed? The dataset was released in October 2021.
What license (if any) is it distributed under? The data is released under Creative Commons BY-NC 4.0 (Attribution-NonCommercial 4.0 International license). There is a request to cite the corresponding paper if the dataset is used. For commercial use, contact Embrapa Agricultural Informatics business office.
Are there any fees or access/export restrictions? There are no fees or restrictions. For commercial use, contact Embrapa Agricultural Informatics business office.
Dataset Maintenance Who is supporting/hosting/maintaining the dataset? The dataset is hosted at Embrapa Agricultural Informatics and all comments or requests can be sent to Thiago T. Santos (maintainer).
Will the dataset be updated? There is no scheduled updates.
If others want to extend/augment/build on this dataset, is there a mechanism for them to do so? Contributors should contact the maintainer by e-mail.
No warranty The maintainers and their institutions are exempt from any liability, judicial or extrajudicial, for any losses or damages arising from the use of the data contained in the image database.
This dataset was collected by running the High-latitude Ionosphere Dynamics for Research Applications (HIDRA) model which is a significant rewrite of the Ionosphere/Polar Wind Model (IPWM) [Varney et al., 2014] [Varney et al., 2015] [Varney et al., 2016]) and designed as a component of the Multiscale Atmosphere-Geospace Environment (MAGE) framework under development by the Center for Geospace Storms NASA DRIVE Science Center. This dataset was processed using Jupyter Notebook scripts for analysis and vizualization.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The COVID_CO2_ferries dataset stems from variables of the THETIS-MRV and IHS Markit datasets. They are described in the manuscript referenced below in "Additional Resources".
Meaning of the variables in the COVID_CO2_ferries dataset:
number
name
meaning
units
1
IMOn
IMO number (Vessel unique identifier)
-
2
Eber
Per-ship CO2 emissions at berth
ton
3
Etot
Per-ship total CO2 emissions
ton
4
Dom
Sea Basin (NOR, BAL, MED)
-
5
COVID
dummy variable (true in 2020)
-
6
year
year of CO2 emissions
-
7
Pme
Total power of main engines
0/1
8
LOA
Length over all
0/1
9
nPax
Passenger carrying capacity
0/1
10
yearB
year of building
0/1
11
nCalls
Per-ship number of port calls
-
12
VType
Vessel type (see defining Eq. below)
-
The variables with binary values (0/1 in the "units" column) refer to below (0) or above (1) the thresholds defined by:
\(k\)
\(\varphi_k\)
\(\varphi_{k0}\)
units
0
Pme
21,600
kW
1
nPax
1,250
-
2
LOA
174
m
3
yearB
1999
-
The VType variable is defined by:
(\texttt{VType} = \sum_{k=0}^{3} \, 2^{k} \cdot H(\varphi_{k} - \varphi_{k0}) )
Additional Resources
The "How COVID-19 affected GHG emissions of ferries in Europe" manuscript by Mannarini et al. (2022) using this dataset is published on Sustainability 2022, 14(9), 5287; https://doi.org/10.3390/su14095287. You may want to cite it.
A jupyter notebook using this dataset is available at https://github.com/hybrs/COVID-CO2-ferries
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).
The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.
Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.
The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).
The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.
Options to access the dataset
There are two ways how to get access to the dataset:
In order to obtain an access to the dataset (either to full static dump or REST API), please, request the access by following instructions provided below.
References
If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:
@inproceedings{SrbaMonantPlatform, author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria}, booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)}, pages = {1--7}, title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior}, year = {2019} }
@inproceedings{SrbaMonantMedicalDataset, author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria}, booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)}, numpages = {11}, title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims}, year = {2022}, doi = {10.1145/3477495.3531726}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3477495.3531726}, }
Dataset creation process
In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.
Ethical considerations
The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.
The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.
As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.
Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.
Reporting mistakes in the dataset The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.
Dataset structure
Raw data
At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.
Raw data are contained in these CSV files (and corresponding REST API endpoints):
sources.csv
articles.csv
article_media.csv
article_authors.csv
discussion_posts.csv
discussion_post_authors.csv
fact_checking_articles.csv
fact_checking_article_media.csv
claims.csv
feedback_facebook.csv
Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.
Annotations
Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.
Each annotation is described by the following attributes:
category of annotation (annotation_category
). Possible values: label (annotation corresponds to ground truth, determined by human experts) and prediction (annotation was created by means of AI method).
type of annotation (annotation_type_id
). Example values: Source reliability (binary), Claim presence. The list of possible values can be obtained from enumeration in annotation_types.csv.
method which created annotation (method_id
). Example values: Expert-based source reliability evaluation, Fact-checking article to claim transformation method. The list of possible values can be obtained from enumeration methods.csv.
its value (value
). The value is stored in JSON format and its structure differs according to particular annotation type.
At the same time, annotations are associated with a particular object identified by:
entity type (parameter entity_type
in case of entity annotations, or source_entity_type
and target_entity_type
in case of relation annotations). Possible values: sources, articles, fact-checking-articles.
entity id (parameter entity_id
in case of entity annotations, or source_entity_id
and target_entity_id
in case of relation annotations).
The dataset provides specifically these entity annotations:
Source reliability (binary). Determines validity of source (website) at a binary scale with two options: reliable source and unreliable source.
Article veracity. Aggregated information about veracity from article-claim pairs.
The dataset provides specifically these relation annotations:
Fact-checking article to claim mapping. Determines mapping between fact-checking article and claim.
Claim presence. Determines presence of claim in article.
Claim stance. Determines stance of an article to a claim.
Annotations are contained in these CSV files (and corresponding REST API endpoints):
entity_annotations.csv
relation_annotations.csv
Note: Identification of human annotators authors (email provided in the annotation app) is anonymised.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The MCCN project is to deliver tools to assist the agricultural sector to understand crop-environment relationships, specifically by facilitating generation of data cubes for spatiotemporal data. This repository contains Jupyter notebooks to demonstrate the functionality of the MCCN data cube components.The dataset contains input files for the case study (source_data), RO-Crate metadata (ro-crate-metadata.json), results from the case study (results), and Jupyter Notebook (MCCN-CASE 3.ipynb)Research Activity Identifier (RAiD)RAiD: https://doi.org/10.26292/8679d473Case StudiesThis repository contains code and sample data for the following case studies. Note that the analyses here are to demonstrate the software and result should not be considered scientifically or statistically meaningful. No effort has been made to address bias in samples, and sample data may not be available at sufficient density to warrant analysis. All case studies end with generation of an RO-Crate data package including the source data, the notebook and generated outputs, including netcdf exports of the datacubes themselves.Case Study 3 - Select optimal survey localityGiven a set of existing survey locations across a variable landscape, determine the optimal site to add to increase the range of surveyed environments. This study demonstrates: 1) Loading heterogeneous data sources into a cube, and 2) Analysis and visualisation using numpy and matplotlib.Data SourcesThe primary goal for this case study is to demonstrate being able to import a set of environmental values for different sites and then use these to identify a subset that maximises spread across the various environmental dimensions.This is a simple implementation that uses four environmental attributes imported for all Australia (or a subset like NSW) at a moderate grid scale:Digital soil maps for key soil properties over New South Wales, version 2.0 - SEED - see https://esoil.io/TERNLandscapes/Public/Pages/SLGA/ProductDetails-SoilAttributes.htmlANUCLIM Annual Mean Rainfall raster layer - SEED - see https://datasets.seed.nsw.gov.au/dataset/anuclim-annual-mean-rainfall-raster-layerANUCLIM Annual Mean Temperature raster layer - SEED - see https://datasets.seed.nsw.gov.au/dataset/anuclim-annual-mean-temperature-raster-layerDependenciesThis notebook requires Python 3.10 or higherInstall relevant Python libraries with: pip install mccn-engine rocrateInstalling mccn-engine will install other dependenciesOverviewGenerate STAC metadata for layers from predefined configuratiionLoad data cube and exclude nodata valuesScale all variables to a 0.0-1.0 rangeSelect four layers for comparison (soil organic carbon 0-30 cm, soil pH 0-30 cm, mean annual rainfall, mean annual temperature)Select 10 random points within NSWGenerate 10 new layers representing standardised environmental distance between one of the selected points and all other points in NSWFor every point in NSW, find the lowest environmental distance to any of the selected pointsSelect the point in NSW that has the highest value for the lowest environmental distance to any selected point - this is the most different pointClean up and save results to RO-Crate
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the two semantically enriched trajectory datasets introduced in the CIKM Resource Paper "A Semantically Enriched Mobility Dataset with Contextual and Social Dimensions", by Chiara Pugliese (CNR-IIT), Francesco Lettich (CNR-ISTI), Guido Rocchietti (CNR-ISTI), Chiara Renso (CNR-ISTI), and Fabio Pinelli (IMT Lucca, CNR-ISTI).
The two datasets were generated with an open source pipeline based on the Jupyter notebooks published in the GitHub repository behind our resource paper, and our MAT-Builder system. Overall, our pipeline first generates the files that we provide in the [paris|nyc]_input_matbuilder.zip archives; the files are then passed as input to the MAT-Builder system, which ultimately generates the two semantically enriched trajectory datasets for Paris and New York City, both in tabular and RDF formats. For more details on the input and output data, please see the sections below.
The [paris|nyc]_input_matbuilder.zip archives contain the data sources we used with the MAT-Builder system to semantically enrich raw preprocessed trajectories. More specifically, the archives contain the following files:
The [paris|nyc]_output_tabular.zip zip archives contain the output files generated by MAT-Builder that express the semantically enriched Paris and New York City datasets in tabular format. More specifically, they contain the following files:
There is then a second set of columns which represents the characteristics of the POI that has been associated with a stop. The relevant ones are:
2m_temperature
, Celsius degrees) - ssrd: Surface solar radiation (surface_solar_radiation_downwards
, Watt per square meter) - ssrdc: Surface solar radiation clear-sky (surface_solar_radiation_downward_clear_sky
, Watt per square meter) - ro: Runoff (runoff
, millimeters) There are also a set of derived variables: - ws10: Wind speed at 10 meters (derived by 10m_u_component_of_wind
and 10m_v_component_of_wind
, meters per second) - ws100: Wind speed at 100 meters (derived by 100m_u_component_of_wind
and 100m_v_component_of_wind
, meters per second) - CS: Clear-Sky index (the ratio between the solar radiation and the solar radiation clear-sky) - HDD/CDD: Heating/Cooling Degree days (derived by 2-meter temperature the EUROSTAT definition. For each variable we have 367 440 hourly samples (from 01-01-1980 00:00:00 to 31-12-2021 23:00:00) for 34/115/309 regions (NUTS 0/1/2). The data is provided in two formats: - NetCDF version 4 (all the variables hourly and CDD/HDD daily). NOTE: the variables are stored as int16
type using a scale_factor
to minimise the size of the files. - Comma Separated Value ("single index" format for all the variables and the time frequencies and "stacked" only for daily and monthly) All the CSV files are stored in a zipped file for each variable. ## Methodology The time-series have been generated using the following workflow: 1. The NetCDF files are downloaded from the Copernicus Data Store from the ERA5 hourly data on single levels from 1979 to present dataset 2. The data is read in R with the climate4r packages and aggregated using the function /get_ts_from_shp
from panas. All the variables are aggregated at the NUTS boundaries using the average except for the runoff, which consists of the sum of all the grid points within the regional/national borders. 3. The derived variables (wind speed, CDD/HDD, clear-sky) are computed and all the CSV files are generated using R 4. The NetCDF are created using xarray
in Python 3.8. ## Example notebooks In the folder notebooks
on the associated Github repository there are two Jupyter notebooks which shows how to deal effectively with the NetCDF data in xarray
and how to visualise them in several ways by using matplotlib or the enlopy package. There are currently two notebooks: - exploring-ERA-NUTS: it shows how to open the NetCDF files (with Dask), how to manipulate and visualise them. - ERA-NUTS-explore-with-widget: explorer interactively the datasets with jupyter and ipywidgets. The notebook exploring-ERA-NUTS
is also available rendered as HTML. ## Additional files In the folder additional files
on the associated Github repository there is a map showing the spatial resolution of the ERA5 reanalysis and a CSV file specifying the number of grid points with respect to each NUTS0/1/2 region. ## License This dataset is released under CC-BY-4.0 license.Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Replication Package Files
1. Forms.zip: contains the forms used to collect data for the experiment
2. Experiments.zip: contains the participants’ and sandboxers’ experimental task workflow with Newton.
3. Responses.zip: contains the responses collected from participants during the experiments.
4. Analysis.zip: contains the data analysis scripts and results of the experiments.
5. newton.zip: contains the tool we used for the WoZ experiment.
TutorialStudy.pdf: script used in the experiment with and without Newton to be consistent with all participants.
Woz_Script.pdf: script wizard used to maintain consistent Newton responses among the participants.
1. Forms.zip
The forms zip contains the following files:
Demographics.pdf: a PDF form used to collect demographic information from participants before the experiments
Post-Task Control (without the tool).pdf: a PDF form used to collect data from participants about challenges and interactions when performing the task without Newton
Post-Task Newton (with the tool).pdf: a PDF form used to collect data from participants after the task with Newton.
Post-Study Questionnaire.pdf: a PDF form used to collect data from the participant after the experiment.
2. Experiments.zip
The experiments zip contains two types of folders:
exp[participant’s number]-c[number of dataset used for control task]e[number of dataset used for experimental task]. Example: exp1-c2e1 (experiment participant 1 - control used dataset 2, experimental used dataset 1)
sandboxing[sandboxer’s number]. Example: sandboxing1 (experiment with sandboxer 1)
Every experiment subfolder contains:
warmup.json: a JSON file with the results of Newton-Participant interactions in the chat for the warmup task.
warmup.ipynb: a Jupyter notebook file with the participant’s results from the code provided by Newton in the warmup task.
sample1.csv: Death Event dataset.
sample2.csv: Heart Disease dataset.
tool.ipynb: a Jupyter notebook file with the participant’s results from the code provided by Newton in the experimental task.
python.ipynb: a Jupyter notebook file with the participant’s results from the code they tried during the control task.
results.json: a JSON file with the results of Newton-Participant interactions in the chat for the task with Newton.
To load an experiment chat log into Newton, add the following code to the notebook:
import anachat
import json
with open("result.json", "r") as f:
anachat.comm.COMM.history = json.load(f)
Then, click on the notebook name inside Newton chat
Note 1: the subfolder for P6 is exp6-e2c1-serverdied because the experiment server died before we were able to save the logs. We reconstructed them using the notebook newton_remake.ipynb based on the video recording.
Note 2: The sandboxing occurred during the development of Newton. We did not collect all the files, and the format of JSON files is different than the one supported by the attached version of Newton.
3. Responses.zip
The responses zip contains the following files:
demographics.csv: a CSV file containing the responses collected from participants using the demographics form
task_newton.csv: a CSV file containing the responses collected from participants using the post-task newton form.
task_control.csv: a CSV file containing the responses collected from participants using the post-task control form.
post_study.csv: a CSV file containing the responses collected from participants using the post-study control form.
4. Analysis.zip
The analysis zip contains the following files:
1.Challenge.ipynb: a Jupyter notebook file where the perceptions of challenges figure was created.
2.Interactions.py: a Python file where the participants’ JSON files were created.
3.Interactions.Graph.ipynb: a Jupyter notebook file where the participant’s interaction figure was created.
4.Interactions.Count.ipynb: a Jupyter notebook file that counts participants’ interaction with each figure.
config_interactions.py: this file contains the definitions of interaction colors and grouping
interactions.json: a JSON file with the interactions during the Newton task of each participant based on the categorization.
requirements.txt: dependencies required to run the code to generate the graphs and json analysis.
To run the analyses, install the dependencies on python 3.10 with the following command and execute the scripts and notebooks in order.:
pip install -r requirements.txt
5. newton.zip
The newton zip contains the source code of the Jupyter Lab extension we used in the experiments. Read the README.md file inside it for instructions on how to install and run it.