68 datasets found

Replication Package for ML-EUP Conversational Agent Study
zenodo.org
pdf
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2024). Replication Package for ML-EUP Conversational Agent Study [Dataset]. http://doi.org/10.5281/zenodo.7780223
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7780223
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Replication Package Files

1. Forms.zip: contains the forms used to collect data for the experiment

2. Experiments.zip: contains the participants’ and sandboxers’ experimental task workflow with Newton.

3. Responses.zip: contains the responses collected from participants during the experiments.

4. Analysis.zip: contains the data analysis scripts and results of the experiments.

5. newton.zip: contains the tool we used for the WoZ experiment.

TutorialStudy.pdf: script used in the experiment with and without Newton to be consistent with all participants.

Woz_Script.pdf: script wizard used to maintain consistent Newton responses among the participants.

1. Forms.zip

The forms zip contains the following files:

Demographics.pdf: a PDF form used to collect demographic information from participants before the experiments

Post-Task Control (without the tool).pdf: a PDF form used to collect data from participants about challenges and interactions when performing the task without Newton

Post-Task Newton (with the tool).pdf: a PDF form used to collect data from participants after the task with Newton.

Post-Study Questionnaire.pdf: a PDF form used to collect data from the participant after the experiment.

2. Experiments.zip

The experiments zip contains two types of folders:

exp[participant’s number]-c[number of dataset used for control task]e[number of dataset used for experimental task]. Example: exp1-c2e1 (experiment participant 1 - control used dataset 2, experimental used dataset 1)

sandboxing[sandboxer’s number]. Example: sandboxing1 (experiment with sandboxer 1)

Every experiment subfolder contains:

warmup.json: a JSON file with the results of Newton-Participant interactions in the chat for the warmup task.

warmup.ipynb: a Jupyter notebook file with the participant’s results from the code provided by Newton in the warmup task.

sample1.csv: Death Event dataset.

sample2.csv: Heart Disease dataset.

tool.ipynb: a Jupyter notebook file with the participant’s results from the code provided by Newton in the experimental task.

python.ipynb: a Jupyter notebook file with the participant’s results from the code they tried during the control task.

results.json: a JSON file with the results of Newton-Participant interactions in the chat for the task with Newton.

To load an experiment chat log into Newton, add the following code to the notebook:

import anachat import json with open("result.json", "r") as f: anachat.comm.COMM.history = json.load(f)

Then, click on the notebook name inside Newton chat

Note 1: the subfolder for P6 is exp6-e2c1-serverdied because the experiment server died before we were able to save the logs. We reconstructed them using the notebook newton_remake.ipynb based on the video recording.

Note 2: The sandboxing occurred during the development of Newton. We did not collect all the files, and the format of JSON files is different than the one supported by the attached version of Newton.

3. Responses.zip

The responses zip contains the following files:

demographics.csv: a CSV file containing the responses collected from participants using the demographics form

task_newton.csv: a CSV file containing the responses collected from participants using the post-task newton form.

task_control.csv: a CSV file containing the responses collected from participants using the post-task control form.

post_study.csv: a CSV file containing the responses collected from participants using the post-study control form.

4. Analysis.zip

The analysis zip contains the following files:

1.Challenge.ipynb: a Jupyter notebook file where the perceptions of challenges figure was created.

2.Interactions.py: a Python file where the participants’ JSON files were created.

3.Interactions.Graph.ipynb: a Jupyter notebook file where the participant’s interaction figure was created.

4.Interactions.Count.ipynb: a Jupyter notebook file that counts participants’ interaction with each figure.

config_interactions.py: this file contains the definitions of interaction colors and grouping

interactions.json: a JSON file with the interactions during the Newton task of each participant based on the categorization.

requirements.txt: dependencies required to run the code to generate the graphs and json analysis.

To run the analyses, install the dependencies on python 3.10 with the following command and execute the scripts and notebooks in order.:

pip install -r requirements.txt

5. newton.zip

The newton zip contains the source code of the Jupyter Lab extension we used in the experiments. Read the README.md file inside it for instructions on how to install and run it.
Z
Exploratory Topic Modelling in Python Dataset - EHRI-3
data.niaid.nih.gov
Updated Jun 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dermentzi, Maria (2022). Exploratory Topic Modelling in Python Dataset - EHRI-3 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6670103
Explore at:
Dataset updated
Jun 20, 2022
Dataset authored and provided by
Dermentzi, Maria
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In the EHRI-3 project, we are investigating tools and methods that historical researchers and scholars can use to better understand, visualise, and interpret the material held by our partner archives. This dataset accompanies a tutorial exploring a technique called topic modelling in the context of a Holocaust-related historical collection.

We were on the lookout for datasets that would be easily accessible and, for convenience, predominantly in English. One such dataset was the United States Holocaust Memorial Museum’s (USHMM) extensive collection of oral history testimonies, for which there are a considerable number of textual transcripts. The museum’s total collection consists of over 80,703 testimonies, 41,695 of which are available in English, with 2,894 of them listing a transcript.

Since there is not yet a ready-to-download dataset that includes these transcripts, we had to construct our own. Using a web scraping tool, we managed to create a list of the links pointing to the metadata (including transcripts) of the testimonies that were of interest to us. After obtaining the transcript and other metadata of each of these testimonies, we were able to create our dataset and curate it to remove any unwanted entries. For example, we made sure to remove entries with restrictions on access or use. We also removed entries with transcripts that consisted only of some automatically generated headers and entries which turned out to be in languages other than English. The remaining 1,873 transcripts form the corpus of this tutorial — a small, but still decently sized dataset.

The process that we followed to put together this dataset is detailed in the Jupyter Notebook accompanying this post, which can be found in this Github repository.

In this Zenodo upload, the user can find two files, each of them containing a pickled pandas DataFrame that was obtained at a different stage of the tutorial:

"unrestricted_df.pkl" contains 1,946 entries of Oral Testimony transcripts and has five fields (RG_number, text, display_date, conditions_access, conditions_use) "unrestricted_lemmatized_df.pkl" contains 1,873 entries of Oral Testimony transcripts and has six fields (RG_number, text, display_date, conditions_access, conditions_use, lemmas)

Instructions on their intended use can be found in the accompanying Jupyter Notebook.

Credits:

The transcripts that form the corpus in this tutorial were obtained through the United States Holocaust Memorial Museum (USHMM).
Dataset for "Beyond Self-Promotion: How Software Engineering Research Is...
zenodo.org
csv, zip
Updated Jan 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marvin Wyrich; Marvin Wyrich; Justus Bogner; Justus Bogner (2024). Dataset for "Beyond Self-Promotion: How Software Engineering Research Is Discussed on LinkedIn" [Dataset]. http://doi.org/10.5281/zenodo.10453832
Explore at:
csv, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10453832
Dataset updated
Jan 3, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marvin Wyrich; Marvin Wyrich; Justus Bogner; Justus Bogner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the artifacts of our study on how software engineering research papers are shared and interacted with on LinkedIn, a professional social network. This includes:

included-papers.csv: the list of the 79 ICSE and FSE papers we found on LinkedIn

linkedin-post-data.csv: the final data of the 98 LinkedIn posts we collected and synthesized

linkedin-post-scraping.zip: the scripts used to automatically collect several attributes of the LinkedIn posts

analysis.zip: the Jupyter notebook used to analyze and visualize linkedin-post-data.csv
McCulloch et al 2022 UM post-processed Mars dataset
zenodo.org
data.niaid.nih.gov
bin, nc
Updated Aug 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Danny McCulloch; Danny McCulloch; Denis Sergeev; Denis Sergeev; Nathan Mayne; Nathan Mayne; Matthew Bate; James Manners; James Manners; Ian Boutle; Ian Boutle; Benjamin Drummond; Benjamin Drummond; Matthew Bate (2022). McCulloch et al 2022 UM post-processed Mars dataset [Dataset]. http://doi.org/10.5281/zenodo.6974260
Explore at:
nc, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6974260
Dataset updated
Aug 9, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Danny McCulloch; Danny McCulloch; Denis Sergeev; Denis Sergeev; Nathan Mayne; Nathan Mayne; Matthew Bate; James Manners; James Manners; Ian Boutle; Ian Boutle; Benjamin Drummond; Benjamin Drummond; Matthew Bate
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary dataset and Jupyter notebook for preproduction of UM data within figures presented in McCulloch et al., 2022.

Data is a post-processed extract of the raw dataset for each variable. Data from the raw dataset has been extracted according to the appropriate Martian month, zonally meaned and converted to a \(\sigma\)/pressure coordinate system. This process is the same as is applied to the MCD dataset, which can be seen in the Jupyter notebook.

The notebook provides the code needed to reproduce the figures with the given data. All instructions are detailed within the notebook, including package dependencies and configuration options. Due to licensing, we are only able to provide access to the UM post-processed data, for the MCD dataset please follow the instructions within the notebook.
H
SUMMA Simulations using CAMELS Datasets for HPC use with CyberGIS-Jupyter...
hydroshare.org
zip
Updated Apr 12, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Young-Don Choi; Iman Maghami; Ashley Van Beusekom; Zhiyu/Drew Li; Bart Nijssen; Lauren Hay; Andrew Bennett; David Tarboton; Jonathan Goodall; Martyn P. Clark; Shaowen Wang (2023). SUMMA Simulations using CAMELS Datasets for HPC use with CyberGIS-Jupyter for Water [Dataset]. http://doi.org/10.4211/hs.9d73d61696ee4f6b9c9a11e21cd44e24
Explore at:
zip(1.5 MB)Available download formats
Unique identifier
https://doi.org/10.4211/hs.9d73d61696ee4f6b9c9a11e21cd44e24
Dataset updated
Apr 12, 2023
Dataset provided by
HydroShare
Authors
Young-Don Choi; Iman Maghami; Ashley Van Beusekom; Zhiyu/Drew Li; Bart Nijssen; Lauren Hay; Andrew Bennett; David Tarboton; Jonathan Goodall; Martyn P. Clark; Shaowen Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1980 - Dec 31, 2018
Area covered

Description
This resource, configured for execution in connected JupyterHub compute platforms using the CyberGIS-Jupyter for Water (CJW) environment's supported High-Performance Computing (HPC) resources (Expanse or Virtual ROGER) through CyberGIS-Compute Service, helps the modelers to reproduce and build on the results from the VB study (Van Beusekom et al., 2022) as explained by Maghami et el. (2023).

For this purpose, four different Jupyter notebooks are developed and included in this resource which explore the paper goal for four example CAMELS site and a pre-selected period of 60-month simulation to demonstrate the capabilities of the notebooks. The first notebook processes the raw input data from CAMELS dataset to be used as input for SUMMA model. The second notebook utilizes the CJW environment's supported HPC resource (Expanse or Virtual ROGER) through CyberGIS-Compute Service to executes SUMMA model. This notebook uses the input data from first notebook using original and altered forcing, as per further described in the notebook. The third notebook utilizes the outputs from notebook 2 and visualizes the sensitivity of SUMMA model outputs using Kling-Gupta Efficiency (KGE). The fourth notebook, only developed for the HPC environment (and only currently working with Expanse HPC), enables transferring large data from HPC to the scientific cloud service (i.e., CJW) using Globus service integrated by CyberGIS-Compute in a reliable, high-performance and fast way. More information about each Jupyter notebook and a step-by-step instructions on how to run the notebooks can be found in the Readme.md fie included in this resource. Using these four notebooks, modelers can apply the methodology mentioned above to any (one to all) of the 671 CAMELS basins and simulation periods of their choice. As this resource uses HPC, it enables a high-speed running of simulations which makes it suitable for larger simulations (even as large as the entire 671 CAMELS sites and the whole 60-month simulation period used in the paper) practical and much faster than when no HPC is used.
c
Blockchain Address Poisoning (Companion Dataset)
kilthub.cmu.edu
application/gzip
Updated Jun 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taro Tsuchiya; Jin Dong Dong; Kyle Soska; Nicolas Christin (2025). Blockchain Address Poisoning (Companion Dataset) [Dataset]. http://doi.org/10.1184/R1/29212703.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.1184/R1/29212703.v1
Dataset updated
Jun 3, 2025
Dataset provided by
Carnegie Mellon University
Authors
Taro Tsuchiya; Jin Dong Dong; Kyle Soska; Nicolas Christin
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
In many blockchains, e.g., Ethereum, Binance Smart Chain (BSC), the primary representation used for wallet addresses is a hardly memorable 40-digit hexadecimal string. As a result, users often select addresses from their recent transaction history, which enables blockchain address poisoning. The adversary first generates lookalike addresses similar to one with which the victim has previously interacted, and then engages with the victim to “poison” their transaction history. The goal is to have the victim mistakenly send tokens to the lookalike address, as opposed to the intended recipient. We develop a detection system and perform measurements over two years on Ethereum and BSC. We release the detection result dataset, including over 17 million attack attempts on Ethereum and successful payoff transfers. We also provide a jupyter notebook explaining 1) how to access the dataset, 2) how to produce descriptive statistics such as the number of poisoning transfers, and 3) how to manually verify the payoff transfer on Etherscan (BSCscan). This dataset will enable other researchers to validate our results as well as conduct further analysis.
F
OpenAXES Example Robot Dataset
data.uni-hannover.de
zip,csv
Updated Jul 24, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Institut fuer Mikroelektronische Systeme (2023). OpenAXES Example Robot Dataset [Dataset]. https://data.uni-hannover.de/dataset/openaxes-example-robot-dataset
Explore at:
zip,csv(423808715)Available download formats
Dataset updated
Jul 24, 2023
Dataset authored and provided by
Institut fuer Mikroelektronische Systeme
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
This is an example dataset recorded using version 1.0 of the open-source-hardware OpenAXES IMU. Please see the github repository for more information on the hardware and firmware. Please find the most up-to-date version of this document in the repository

This dataset was recorded using four OpenAXES IMUs mounted on the segments of a robot arm (UR5 by Universal Robots). The robot arm was programmed to perform a calibration movement, then trace a 2D circle or triangle in the air with its tool center point (TCP), and return to its starting position, at four different speeds from 100 mm/s to 250 mm/s. This results in a total of 8 different scenarios (2 shapes times 4 speeds). The ground truth joint angle and TCP position values were obtained from the robot controller. The calibration movement at the beginning of the measurement allows for calculating the exact orientation of the sensors on the robot arm.

The IMUs were configured to send the raw data from the three gyroscope axes and the six accelerometer axes to a PC via BLE with 16 bit resolution per axis and 100 Hz sample rate. Since no data packets were lost during this process, this dataset allows comparing and tuning different sensor fusion algorithms on the recorded raw data while using the ground truth robot data as a reference.

In order to visualize the results, the quaternion sequences from the IMUs were applied to the individual segments of a 3D model of the robot arm. The end of this kinematic chain represents the TCP of the virtual model, which should ideally move along the same trajectory as the ground truth, barring the accuracy of the IMUs. Since the raw sensor data of these measurements is available, the calibration coefficients can also be applied ex-post.

Since there are are 6 joints but only 4 IMUS, some redundancy must be exploited. The redundancy comes from the fact that each IMU has 3 rotational degrees of fredom, but each joint has only one:

The data for q0 and q1 are both derived from the orientation of the "humerus" IMU.

q2 is the difference^† between the orientation of the "humerus" and "radius" IMUs.

q3 is the difference between the orientation of the "radius" and "carpus" IMUs.

q4 is the difference between the orientation of the "carpus" and "digitus" IMUs.

The joint q5 does not influence the position of the TCP, only its orientation, so it is ignored in the evaluation.

^† Of course, difference here means not the subtraction of the quaternions but the rotational difference, which is the R1 * inv(R0) for two quaternions (or rotations) R0 and R1. The actual code works a bit differently, but this describes the general principle.

Data

Data recorded from the IMUs is in the directory measure_raw-2022-09-15/, one folder per scenario. In those folders, there is one CSV file per IMU.

Data recorded from the robot arm is in the directory measure_raw-2022-09-15/robot/, one CSV and MAT file per scenario.

Some photos and videos of the recording process can be found in Media. Videos are stored in git lfs.

Evaluation

The file openaxes-example-robot-dataset.ipynb is provided to play around with the data in the dataset and demonstrate how the files are read and interpreted. To use the notebook, set up a Python 3 virtual environment and therein install the necessary packets with pip install -r resuirements.txt. In order to view the graphs contained in the ipynb file, you will most likely have to trust the notebook beforehand, using the following command:

jupyter trust openaxes-example-robot-dataset.ipynb

Beware: This notebook is not a comprehensive evaluation and any results and plots shown in the file are not necessarily scientifically sound evidence of anything.

The notebook will store intermediate files in the measure_raw-2022-09-15 directory, like the quaternion files calculated by the different filters, or the files containing the reconstructed TCP positions. All intermediate files should be ignored by the file measure_raw-2022-09-15/.gitignore.

The generated intermediate files are also provided in the file measure_raw-2022-09-15.tar.bz2, in case you want to inspect the generated files without running the the notebook.

Tools

A number of tools are used in the evaluation notebook. Below is a short overview, but not a complete specification. If you need to understand the input and output formats for each tool, please read the code.

The file calculate-quaternions.py is used in the evaluation notebook to compute different attitude estimation filters like Madgwick or VQF on the raw accelerometer and gyroscrope measurements at 100 Hz.

The directory madgwick-filter contains a small C program that applies the original Madgwick filter to a CSV file containing raw measurements and prints the results. It is used by calculate-quaternions.py.

The file calculate-robot-quaternions.py calculates a CSV file of quaternions equivalent to the IMU quaternions from a CSV file containing the joint angles of the robot.

The program dsense_vis mentioned in the notebook is used to calculate the 3D model of the robot arm from quaternions and determine the mounting orientations of the IMUs on the robot arm. This program will be released at a future date. In the meantime, the output files of dsense_vis are provided in the file measure_raw-2022-09-15.tar.bz2, which contains the complete content of the measure_raw-2022-09-15 directory after executing the whole notebook. Just unpack this archive and merge its contents with the measure_raw-2022-09-15 directory. This allows you to explore the reconstructed TCP files for the filters implemented at the time of publication.
f
Dataset and jupyter notebooks to generate the plots of the study: "A...
figshare.com
txt
Updated Dec 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ricardo Manuel Garcia Vazquez; Zhongfang Sun; Chung-Hsin Yang; Lisan David Cabrera-González; Otoniel Denis-Alpizar; Philippe Halvick; David H. Parker; Thierry Stoecklin (2024). Dataset and jupyter notebooks to generate the plots of the study: "A Comprehensive Study of the Differential Cross Sections for Water-Rare Gas Collisions: Experimental andTheoretical Perspectives" [Dataset]. http://doi.org/10.6084/m9.figshare.28078265.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28078265.v1
Dataset updated
Dec 23, 2024
Dataset provided by
figshare
Authors
Ricardo Manuel Garcia Vazquez; Zhongfang Sun; Chung-Hsin Yang; Lisan David Cabrera-González; Otoniel Denis-Alpizar; Philippe Halvick; David H. Parker; Thierry Stoecklin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data supports the results presented in the paper "A Comprehensive Study of the Differential Cross Sections for Water-Rare Gas Collisions: Experimental and Theoretical Perspectives". This research encompasses the analysis of the differential cross-section for the excitation of the fundamental ortho and para levels of water molecules by collision with Ne, Ar and Xe. A joint experimental and theoretical study has been undertaken to this end.
Z
Datasets for Deep Learning Based Radio Frequency Side-Channel Attack on...
data.niaid.nih.gov
zenodo.org
Updated Jan 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stöcker, Markus (2024). Datasets for Deep Learning Based Radio Frequency Side-Channel Attack on Quantum Key Distribution [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7956054
Explore at:
Dataset updated
Jan 19, 2024
Dataset provided by
Knips, Lukas
Baliuka, Adomas
Stöcker, Markus
Weinfurter, Harald
Auer, Michael
Freiwang, Peter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains measurements of radio-frequency electromagnetic emissions from a home-built sender module for BB84 quantum key distribution. The goal of these measurements was to evaluate information leakage through this side-channel. This dataset supplements our publication and allows to reproduce our results together with the source code hosted at GitHub (and also on Zenodo via integration with GitHub).The measurements are performed using a magnetic near-field probe, an amplifier and an oscilloscope. The dataset contains raw measured data in the file format output by the oscilloscope. Use our source code to make use of it. Detailed descriptions of measurement procedure can be found in our paper and in the metadata JSON files found within the dataset.

Commented list of datasets

This file lists the datasets that were analyzed and reported on in the paper. The datasets in the list refer to directories here. Note that most of the datasets contain additional files with metadata, which detail where and how the measurements were performed. The mentioned Jupyter notebooks refer to the source code repository https://github.com/XQP-Munich/EmissionSecurityQKD (not included in this dataset). Most of those notebooks output JSON files storing results. The processed JSON files are also included in the source code repository.

In naming of datasets,

Antenna refers to the log-periodic dipole antenna. All datasets that do not contain Antenna in their name are recorded with the magnetic near-field probe.

Rev1 refers to the initial electronics design, while rev2 refers to the revised electronics design which contains countermeasures aiming to reduce emissions.

Shielding refers to measurements where the device is enclosed in a metallic shielding and the measurement takes place outside the shielding.

Rotation refers to orientation of the magnetic near-field probe at the same spacial location

Datasets collected with near-field probe for Rev1 electronics

Rev1Distance: contains measurements at different distances from the Rev1 electronics performed above the FPGA. The deep learning attack is analyzed in TEMPEST_ATTACK.ipynb. The amplitude is analyzed in get_raw_data_RMS_amplitude.ipynb.

Rev12D: different locations on a 2d grid at a constant distance from the electronics. The deep learning attack is analyzed in TEMPEST_ATTACK.ipynb.

Rev130meas2.5cm: 30 measurements above the FPGA at a hight of 2.5cm. Used to evaluate how much amount of training data affects neural network performance. The deep learning attack is analyzed in notebooks TEMPEST_ATTACK*.ipynb. In particular, TEMPEST_ATTACK_VARY_TRAINING_DATA.ipynb is used on this dataset.

Rev1Rotation10deg contains a measurement for varying orientation of the probe at the same location. This is not mentioned in the paper and is only included for completeness. The deep learning attack is analyzed in notebooks TEMPEST_ATTACK*.ipynb.

Rev1TEMPESTShieldingFPGA Measurements with and without shielding at 4cm above the FPGA.

Rev1TEMPESTShieldingUSBHole Measurements with shielding in front of a hole of size about 2cm x 2cm. The deep learning attack is analyzed in TEMPEST_ATTACK*.ipynb.

Datasets collected with near-field probe for Rev2 electronics

Rev2Distance contains measurements at different distances from the Rev2 electronics performed above the FPGA.

Rev22D and Rev22Dstart_7_0 contain measurements on a 2d grid performed on the revised electronics. The dataset is split in two directories because the measurement procedure crashed in the middle. This split structure was kept in order to maintain consistency with the automatic metadata.

Rev230meas2.5cm 30 measurements above the FPGA at a hight of 2.5cm. Used to evaluate how much amount of training data affects neural network performance. The deep learning attack is analyzed in notebooks TEMPEST_ATTACK*.ipynb. In particular, TEMPEST_ATTACK_VARY_TRAINING_DATA.ipynb is used on this dataset.

Other datasets

BackgroundTuesday background measurement (QKD device is not powered at all) performed with near-field probe on 2022 June 21st.

BackgroundSaturday background measurement (QKD device is not powered at all) performed with near-field probe on 2022 June 11th.

AntennaSpectra Dataset of spectra directly recorded by the oscilloscope. Used to demonstrate ability of telling apart the situation of sending QKD key (standard operation) and having the device turned on but not sending any key at a distance. Analyzed in notebook Comparing_KeyNokey_Measurements.ipynb.

Rev2ShieldingAntenna Raw amplitude measurements with log-periodic dipole antenna on Rev2 electronics including shielding enclosure, collected at various distances. None of our attacks against this scenario were successful. The dataset represents a challenge to test more advanced attacks using improved data processing.
u
Data from: T1DiabetesGranada: a longitudinal multi-modal dataset of type 1...
produccioncientifica.ugr.es
data.niaid.nih.gov
Updated 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodriguez-Leon, Ciro; Aviles Perez, Maria Dolores; Banos, Oresti; Quesada-Charneco, Miguel; Lopez-Ibarra, Pablo J; Villalonga, Claudia; Munoz-Torres, Manuel; Rodriguez-Leon, Ciro; Aviles Perez, Maria Dolores; Banos, Oresti; Quesada-Charneco, Miguel; Lopez-Ibarra, Pablo J; Villalonga, Claudia; Munoz-Torres, Manuel (2023). T1DiabetesGranada: a longitudinal multi-modal dataset of type 1 diabetes mellitus [Dataset]. https://produccioncientifica.ugr.es/documentos/668fc429b9e7c03b01bd53b7
Explore at:
Dataset updated
2023
Authors
Rodriguez-Leon, Ciro; Aviles Perez, Maria Dolores; Banos, Oresti; Quesada-Charneco, Miguel; Lopez-Ibarra, Pablo J; Villalonga, Claudia; Munoz-Torres, Manuel; Rodriguez-Leon, Ciro; Aviles Perez, Maria Dolores; Banos, Oresti; Quesada-Charneco, Miguel; Lopez-Ibarra, Pablo J; Villalonga, Claudia; Munoz-Torres, Manuel
Description
T1DiabetesGranada

A longitudinal multi-modal dataset of type 1 diabetes mellitus

Documented by:

Rodriguez-Leon, C., Aviles-Perez, M. D., Banos, O., Quesada-Charneco, M., Lopez-Ibarra, P. J., Villalonga, C., & Munoz-Torres, M. (2023). T1DiabetesGranada: a longitudinal multi-modal dataset of type 1 diabetes mellitus. Scientific Data, 10(1), 916. https://doi.org/10.1038/s41597-023-02737-4

Background

Type 1 diabetes mellitus (T1D) patients face daily difficulties in keeping their blood glucose levels within appropriate ranges. Several techniques and devices, such as flash glucose meters, have been developed to help T1D patients improve their quality of life. Most recently, the data collected via these devices is being used to train advanced artificial intelligence models to characterize the evolution of the disease and support its management. The main problem for the generation of these models is the scarcity of data, as most published works use private or artificially generated datasets. For this reason, this work presents T1DiabetesGranada, a open under specific permission longitudinal dataset that not only provides continuous glucose levels, but also patient demographic and clinical information. The dataset includes 257780 days of measurements over four years from 736 T1D patients from the province of Granada, Spain. This dataset progresses significantly beyond the state of the art as one the longest and largest open datasets of continuous glucose measurements, thus boosting the development of new artificial intelligence models for glucose level characterization and prediction.

Data Records

The data are stored in four comma-separated values (CSV) files which are available in T1DiabetesGranada.zip. These files are described in detail below.

Patient_info.csv

Patient_info.csv is the file containing information about the patients, such as demographic data, start and end dates of blood glucose level measurements and biochemical parameters, number of biochemical parameters or number of diagnostics. This file is composed of 736 records, one for each patient in the dataset, and includes the following variables:

Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.

Sex – Sex of the patient. Values: F (for female), masculine (for male)

Birth_year – Year of birth of the patient. Format: YYYY.

Initial_measurement_date – Date of the first blood glucose level measurement of the patient in the Glucose_measurements.csv file. Format: YYYY-MM-DD.

Final_measurement_date – Date of the last blood glucose level measurement of the patient in the Glucose_measurements.csv file. Format: YYYY-MM-DD.

Number_of_days_with_measures – Number of days with blood glucose level measurements of the patient, extracted from the Glucose_measurements.csv file. Values: ranging from 8 to 1463.

Number_of_measurements – Number of blood glucose level measurements of the patient, extracted from the Glucose_measurements.csv file. Values: ranging from 400 to 137292.

Initial_biochemical_parameters_date – Date of the first biochemical test to measure some biochemical parameter of the patient, extracted from the Biochemical_parameters.csv file. Format: YYYY-MM-DD.

Final_biochemical_parameters_date – Date of the last biochemical test to measure some biochemical parameter of the patient, extracted from the Biochemical_parameters.csv file. Format: YYYY-MM-DD.

Number_of_biochemical_parameters – Number of biochemical parameters measured on the patient, extracted from the Biochemical_parameters.csv file. Values: ranging from 4 to 846.

Number_of_diagnostics – Number of diagnoses realized to the patient, extracted from the Diagnostics.csv file. Values: ranging from 1 to 24.

Glucose_measurements.csv

Glucose_measurements.csv is the file containing the continuous blood glucose level measurements of the patients. The file is composed of more than 22.6 million records that constitute the time series of continuous blood glucose level measurements. It includes the following variables:

Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.

Measurement_date – Date of the blood glucose level measurement. Format: YYYY-MM-DD.

Measurement_time – Time of the blood glucose level measurement. Format: HH:MM:SS.

Measurement – Value of the blood glucose level measurement in mg/dL. Values: ranging from 40 to 500.

Biochemical_parameters.csv

Biochemical_parameters.csv is the file containing data of the biochemical tests performed on patients to measure their biochemical parameters. This file is composed of 87482 records and includes the following variables:

Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.

Reception_date – Date of receipt in the laboratory of the sample to measure the biochemical parameter. Format: YYYY-MM-DD.

Name – Name of the measured biochemical parameter. Values: 'Potassium', 'HDL cholesterol', 'Gammaglutamyl Transferase (GGT)', 'Creatinine', 'Glucose', 'Uric acid', 'Triglycerides', 'Alanine transaminase (GPT)', 'Chlorine', 'Thyrotropin (TSH)', 'Sodium', 'Glycated hemoglobin (Ac)', 'Total cholesterol', 'Albumin (urine)', 'Creatinine (urine)', 'Insulin', 'IA ANTIBODIES'.

Value – Value of the biochemical parameter. Values: ranging from -4.0 to 6446.74.

Diagnostics.csv

Diagnostics.csv is the file containing diagnoses of diabetes mellitus complications or other diseases that patients have in addition to type 1 diabetes mellitus. This file is composed of 1757 records and includes the following variables:

Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.

Code – ICD-9-CM diagnosis code. Values: subset of 594 of the ICD-9-CM codes (https://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes).

Description – ICD-9-CM long description. Values: subset of 594 of the ICD-9-CM long description (https://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes).

Technical Validation

Blood glucose level measurements are collected using FreeStyle Libre devices, which are widely used for healthcare in patients with T1D. Abbott Diabetes Care, Inc., Alameda, CA, USA, the manufacturer company, has conducted validation studies of these devices concluding that the measurements made by their sensors compare to YSI analyzer devices (Xylem Inc.), the gold standard, yielding results of 99.9% of the time within zones A and B of the consensus error grid. In addition, other studies external to the company concluded that the accuracy of the measurements is adequate.

Moreover, it was also checked in most cases the blood glucose level measurements per patient were continuous (i.e. a sample at least every 15 minutes) in the Glucose_measurements.csv file as they should be.

Usage Notes

For data downloading, it is necessary to be authenticated on the Zenodo platform, accept the Data Usage Agreement and send a request specifying full name, email, and the justification of the data use. This request will be processed by the Secretary of the Department of Computer Engineering, Automatics, and Robotics of the University of Granada and access to the dataset will be granted.

The files that compose the dataset are CSV type files delimited by commas and are available in T1DiabetesGranada.zip. A Jupyter Notebook (Python v. 3.8) with code that may help to a better understanding of the dataset, with graphics and statistics, is available in UsageNotes.zip.

Graphs_and_stats.ipynb

The Jupyter Notebook generates tables, graphs and statistics for a better understanding of the dataset. It has four main sections, one dedicated to each file in the dataset. In addition, it has useful functions such as calculating the patient age, deleting a patient list from a dataset file and leaving only a patient list in a dataset file.

Code Availability

The dataset was generated using some custom code located in CodeAvailability.zip. The code is provided as Jupyter Notebooks created with Python v. 3.8. The code was used to conduct tasks such as data curation and transformation, and variables extraction.

Original_patient_info_curation.ipynb

In the Jupyter Notebook is preprocessed the original file with patient data. Mainly irrelevant rows and columns are removed, and the sex variable is recoded.

Glucose_measurements_curation.ipynb

In the Jupyter Notebook is preprocessed the original file with the continuous glucose level measurements of the patients. Principally rows without information or duplicated rows are removed and the variable with the timestamp is transformed into two new variables, measurement date and measurement time.

Biochemical_parameters_curation.ipynb

In the Jupyter Notebook is preprocessed the original file with patient data of the biochemical tests performed on patients to measure their biochemical parameters. Mainly irrelevant rows and columns are removed and the variable with the name of the measured biochemical parameter is translated.

Diagnostic_curation.ipynb

In the Jupyter Notebook is preprocessed the original file with patient data of the diagnoses of diabetes mellitus complications or other diseases that patients have in addition to T1D.

Get_patient_info_variables.ipynb

In the Jupyter Notebook it is coded the feature extraction process from the files Glucose_measurements.csv, Biochemical_parameters.csv and Diagnostics.csv to complete the file Patient_info.csv. It is divided into six sections, the first three to extract the features from each of the mentioned files and the next three to add the extracted features to the resulting new file.

Data Usage Agreement

The conditions for use are as follows:

You confirm that you will not attempt to re-identify research participants for any reason, including for re-identification theory research.

You commit to keeping the T1DiabetesGranada dataset confidential and secure and will not redistribute data or Zenodo account credentials.

You will require
d
Jupyter Notebooks for the Retrieval of AORC Data for Hydrologic Analysis
search.dataone.org
beta.hydroshare.org
+1more
Updated Jan 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayman Nassar; David Tarboton; Homa Salehabadi; Anthony M. Castronova; Pabitra Dash (2025). Jupyter Notebooks for the Retrieval of AORC Data for Hydrologic Analysis [Dataset]. https://search.dataone.org/view/sha256%3A10fc0f94783088ddf5c7a669dc4842082e4b0cc0390409034cd95b56294c38e7
Explore at:
Dataset updated
Jan 25, 2025
Dataset provided by
Hydroshare
Authors
Ayman Nassar; David Tarboton; Homa Salehabadi; Anthony M. Castronova; Pabitra Dash
Time period covered
Feb 1, 1979 - Jan 31, 2023
Area covered

Description
This HydroShare resource provides Jupyter Notebooks with instructions and code for accessing and subsetting the NOAA Analysis of Record for Calibration (AORC) Dataset. There are two Jupyter Notebooks 1. AORC_Point_Data_Retrieval.ipynb 2. AORC_Zone_Data_Retrieval.ipynb The first retrieves data for a point in the area of the US covered, specified using geographic coordinates. The second retrieves data for areas specified via an uploaded polygon shapefile. These notebooks programmatically retrieve the data from Amazon Web Services (https://registry.opendata.aws/noaa-nws-aorc/), and in the case of shapefile data retrieval average the data over the shapes in the given shapefile. The notebooks provided are coded to retrieve data from AORC version 1.1 released in ZARR format in December 2023.
The Analysis Of Record for Calibration (AORC) is a gridded record of near-surface weather conditions covering the continental United States and Alaska and their hydrologically contributing areas (https://registry.opendata.aws/noaa-nws-aorc/). It is defined on a latitude/longitude spatial grid with a mesh length of 30 arc seconds (~800 m), and a temporal resolution of one hour. Elements include hourly total precipitation, temperature, specific humidity, terrain-level pressure, downward longwave and shortwave radiation, and west-east and south-north wind components. It spans the period from 1979 across the Continental U.S. (CONUS) and from 1981 across Alaska, to the near-present (at all locations). This suite of eight variables is sufficient to drive most land-surface and hydrologic models and is used as input to the National Water Model (NWM) retrospective simulation. While the original NOAA process generated AORC data in netCDF format, the data has been post-processed to create a cloud optimized Zarr formatted equivalent that NOAA also disseminates.
H
Code for JupyterNotebook for scRNA-seq data analysis_Oviduct Pgr cKO and...
dataverse.harvard.edu
Updated Jan 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wipawee Winuthayanon (2025). Code for JupyterNotebook for scRNA-seq data analysis_Oviduct Pgr cKO and P4-treated WT mice [Dataset]. http://doi.org/10.7910/DVN/WXVKVW
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/WXVKVW
Dataset updated
Jan 22, 2025
Dataset provided by
Harvard Dataverse
Authors
Wipawee Winuthayanon
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Codes for single-cell RNA-sequencing analysis of data from mouse oviducts. WT and cKO data were collected from Pgrf/f (WT) or Wnt7aCre/+;Pgrf/f (cKO) at 0.5 days post coitus (dpc) another dataset was collected from C57BL6/J adult mice that were ovariectomized and treated with progesterone (P4) at 1 mg/mouse for 2 or 24 hours compared to vehicle control. The oviducts were dissected into infundibulum+ampulla (InfAmp or IA) and isthmus+UTJ (IsthUTJ or IU) regions.
P
Embrapa ADD 256 Dataset
paperswithcode.com
Updated Oct 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Embrapa ADD 256 Dataset [Dataset]. https://paperswithcode.com/dataset/embrapa-add-256
Explore at:
Dataset updated
Oct 15, 2021
Description
This is a detailed description of the dataset, a data sheet for the dataset as proposed by Gebru et al.

Motivation for Dataset Creation Why was the dataset created? Embrapa ADD 256 (Apples by Drones Detection Dataset — 256 × 256) was created to provide images and annotation for research on *apple detection in orchards for UAV-based monitoring in apple production.

What (other) tasks could the dataset be used for? Apple detection in low-resolution scenarios, similar to the aerial images employed here.

Who funded the creation of the dataset? The building of the ADD256 dataset was supported by the Embrapa SEG Project 01.14.09.001.05.04, Image-based metrology for Precision Agriculture and Phenotyping, and FAPESP under grant (2017/19282-7).

Dataset Composition What are the instances? Each instance consists of an RGB image and an annotation describing apples locations as circular markers (i.e., presenting center and radius).

How many instances of each type are there? The dataset consists of 1,139 images containing 2,471 apples.

What data does each instance consist of? Each instance contains an 8-bits RGB image. Its corresponding annotation is found in the JSON files: each apple marker is composed by its center (cx, cy) and its radius (in pixels), as seen below:

"gebler-003-06.jpg": [ { "cx": 116, "cy": 117, "r": 10 }, { "cx": 134, "cy": 113, "r": 10 }, { "cx": 221, "cy": 95, "r": 11 }, { "cx": 206, "cy": 61, "r": 11 }, { "cx": 92, "cy": 1, "r": 10 } ],

Dataset.ipynb is a Jupyter Notebook presenting a code example for reading the data as a PyTorch's Dataset (it should be straightforward to adapt the code for other frameworks as Keras/TensorFlow, fastai/PyTorch, Scikit-learn, etc.)

Is everything included or does the data rely on external resources? Everything is included in the dataset.

Are there recommended data splits or evaluation measures? The dataset comes with specified train/test splits. The splits are found in lists stored as JSON files.

| | Number of images | Number of annotated apples | | --- | --- | --- | |Training | 1,025 | 2,204 | |Test | 114 | 267 | |Total | 1,139 | 2,471 |

Dataset recommended split.

Standard measures from the information retrieval and computer vision literature should be employed: precision and recall, F1-score and average precision as seen in COCO and Pascal VOC.

What experiments were initially run on this dataset? The first experiments run on this dataset are described in A methodology for detection and location of fruits in apples orchards from aerial images by Santos & Gebler (2021).

Data Collection Process How was the data collected? The data employed in the development of the methodology came from two plots located at the Embrapa’s Temperate Climate Fruit Growing Experimental Station at Vacaria-RS (28°30’58.2”S, 50°52’52.2”W). Plants of the varieties Fuji and Gala are present in the dataset, in equal proportions. The images were taken during December 13, 2018, by an UAV (DJI Phantom 4 Pro) that flew over the rows of the field at a height of 12 m. The images mix nadir and non-nadir views, allowing a more extensive view of the canopies. A subset from the images was random selected and 256 × 256 pixels patches were extracted.

Who was involved in the data collection process? T. T. Santos and L. Gebler captured the images in field. T. T. Santos performed the annotation.

How was the data associated with each instance acquired? The circular markers were annotated using the VGG Image Annotator (VIA).

WARNING: Find non-ripe apples in low-resolution images of orchards is a challenging task even for humans. ADD256 was annotated by a single annotator. So, users of this dataset should consider it a noisy dataset.

Data Preprocessing What preprocessing/cleaning was done? No preprocessing was applied.

Dataset Distribution How is the dataset distributed? The dataset is available at GitHub.

When will the dataset be released/first distributed? The dataset was released in October 2021.

What license (if any) is it distributed under? The data is released under Creative Commons BY-NC 4.0 (Attribution-NonCommercial 4.0 International license). There is a request to cite the corresponding paper if the dataset is used. For commercial use, contact Embrapa Agricultural Informatics business office.

Are there any fees or access/export restrictions? There are no fees or restrictions. For commercial use, contact Embrapa Agricultural Informatics business office.

Dataset Maintenance Who is supporting/hosting/maintaining the dataset? The dataset is hosted at Embrapa Agricultural Informatics and all comments or requests can be sent to Thiago T. Santos (maintainer).

Will the dataset be updated? There is no scheduled updates.

If others want to extend/augment/build on this dataset, is there a mechanism for them to do so? Contributors should contact the maintainer by e-mail.

No warranty The maintainers and their institutions are exempt from any liability, judicial or extrajudicial, for any losses or damages arising from the use of the data contained in the image database.
d
HIDRA simulations and post-processing scripts for JGR: SP manuscript:...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Mar 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Albarran; Roger Varney; Kevin Pham; Dong Lin (2024). HIDRA simulations and post-processing scripts for JGR: SP manuscript: characterization of N+ abundances in the terrestrial polar wind using the multiscale atmosphere-geospace environment [Dataset]. http://doi.org/10.5061/dryad.ghx3ffbws
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.ghx3ffbws
Dataset updated
Mar 22, 2024
Dataset provided by
Dryad
Authors
Robert Albarran; Roger Varney; Kevin Pham; Dong Lin
Description
This dataset was collected by running the High-latitude Ionosphere Dynamics for Research Applications (HIDRA) model which is a significant rewrite of the Ionosphere/Polar Wind Model (IPWM) [Varney et al., 2014] [Varney et al., 2015] [Varney et al., 2016]) and designed as a component of the Multiscale Atmosphere-Geospace Environment (MAGE) framework under development by the Center for Geospace Storms NASA DRIVE Science Center. This dataset was processed using Jupyter Notebook scripts for analysis and vizualization.
Z
Data from: COVID-CO2-ferries
data.niaid.nih.gov
Updated Nov 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Salinas, Mario (2022). COVID-CO2-ferries [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6473157
Explore at:
Dataset updated
Nov 2, 2022
Dataset provided by
Fassò, Alessandro
Salinas, Mario
Mannarini, Gianandrea
Carelli, Lorenzo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The COVID_CO2_ferries dataset stems from variables of the THETIS-MRV and IHS Markit datasets. They are described in the manuscript referenced below in "Additional Resources".

Meaning of the variables in the COVID_CO2_ferries dataset:

number name meaning units 1 IMOn IMO number (Vessel unique identifier) - 2 Eber Per-ship CO2 emissions at berth ton 3 Etot Per-ship total CO2 emissions ton 4 Dom Sea Basin (NOR, BAL, MED) - 5 COVID dummy variable (true in 2020) - 6 year year of CO2 emissions - 7 Pme Total power of main engines 0/1 8 LOA Length over all 0/1 9 nPax Passenger carrying capacity 0/1 10 yearB year of building 0/1 11 nCalls Per-ship number of port calls - 12 VType Vessel type (see defining Eq. below) -

The variables with binary values (0/1 in the "units" column) refer to below (0) or above (1) the thresholds defined by:

\(k\) \(\varphi_k\) \(\varphi_{k0}\) units 0 Pme 21,600 kW 1 nPax 1,250 - 2 LOA 174 m 3 yearB 1999 -

The VType variable is defined by:

(\texttt{VType} = \sum_{k=0}^{3} \, 2^{k} \cdot H(\varphi_{k} - \varphi_{k0}) )

Additional Resources

The "How COVID-19 affected GHG emissions of ferries in Europe" manuscript by Mannarini et al. (2022) using this dataset is published on Sustainability 2022, 14(9), 5287; https://doi.org/10.3390/su14095287. You may want to cite it.

A jupyter notebook using this dataset is available at https://github.com/hybrs/COVID-CO2-ferries
Datasets and scripts related to the paper: "*Can Generative AI Help us in...
zenodo.org
zip
Updated Jul 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous Anonymous; Anonymous Anonymous (2024). Datasets and scripts related to the paper: "*Can Generative AI Help us in Qualitative Software Engineering?*" [Dataset]. http://doi.org/10.5281/zenodo.13134104
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13134104
Dataset updated
Jul 30, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous Anonymous; Anonymous Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This replication package contains datasets and scripts related to the paper: "*Can Generative AI Help us in Qualitative Software Engineering?*"

The replication package is organized into two directories:

- `manual_analysis`: This directory contains all sheets used to perform the manual analysis for RQ1, RQ2, and RQ3.

- `stats`: This directory contains all datasets, scripts, and results metrics used for the quantitative analyses of RQ1 and RQ2.

In the following, we describe the content of each directory:

## manual_analysis

- `manual_analysis_rq1`: This directory contains all sheets used to perform manual analysis for RQ1 (independent and incremental coding).

- The sub-directory `incremental_coding` contains .csv files for all datasets (`DL_Faults_COMMIT_incremental.csv`, `DL_Faults_ISSUE_incremental.csv`, `DL_Fault_SO_incremental.csv`, `DRL_Challenges_incremental.csv` and `Functional_incremental.csv`). All these .csv files contain the following columns:

- *Link*: The link to the instances

- *Prompt*: Prompt used as input to GPT-4-Turbo

- *ID*: Instance ID

- *FinalTag*: Tag assigned by the human in the original paper

- *Chatgpt\_output\_memory*: Output of GPT-4-Turbo with incremental coding

- *Chatgpt\_output\_memory\_clean*: (only for the DL Faults datasets) output of GPT-4-Turbo considering only the label assigned, excluding the text

- *Author1*: Label assigned by the first author

- *Author2*: Label assigned by the second author

- *FinalOutput*: Label assigned after the resolution of the conflicts

- The sub-directory `independent_coding` contains .csv files for all datasets (`DL_Faults_COMMIT_independent.csv`, `DL_Faults_ISSUE_ independent.csv`, `DL_Fault_SO_ independent.csv`, `DRL_Challenges_ independent.csv` and `Functional_ independent.csv`), containing the following columns:

- *Link*: The link to the instances

- *Prompt*: Prompt used as input to GPT-4-Turbo

- *ID*: Specific ID for the instance

- *FinalTag*: Tag assigned by the human in the original paper

- *Chatgpt\_output*: Output of GPT-4-Turbo with independent coding

- *Chatgpt\_output\_clean*: (only for DL Faults datasets) output of GPT-4-Turbo considering only the label assigned, excluding the text

- *Author1*: Label assigned by the first author

- *Author2*: Label assigned by the second author

- *FinalOutput*: Label assigned after the resolution of the conflicts.

- Also, the sub-directory contains sheets with inconsistencies after resolving conflicts. The directory `inconsistency_incremental_coding` contains .csv files with the following columns:

- *Dataset*: The dataset considered

- *Human*: The label assigned by the human in the original paper

- *Machine*: The label assigned by GPT-4-Turbo

- *Classification*: The final label assigned by the authors after resolving the conflicts. Multiple classifications for a single instance are separated by a comma “,”

- *Final*: final label assigned after the resolution of the incompatibilities

- Similarly, the sub-directory `inconsistency_independent_coding` contains a .csv file with the same columns as before, but this is for the case of independent coding.

- `manual_analysis_rq2`: This directory contains .csv files for all datasets (`DL_Faults_redundant_tag.csv`, `DRL_Challenges_redundant_tag.csv`, `Functional_redundant_tag.csv`) to perform manual analysis for RQ2.

- The `DL_Faults_redundant_tag.csv` file contains the following columns:

- *Tags Redundant*: tags identified as redundant by GPT-4-Turbo

- *Matched*: inspection by the authors to see if the tags are redundant matching or not

- *FinalTag*: final tag assigned by the authors after the resolution of the conflict

- The `Functional_redundant_tag.csv` file contains the same columns as before

- The `DRL_Challenges_redundant_tag.csv` file is organized as follows:

- *Tags Suggested*: The final tag suggested by GPT-4-Turbo

- *Tags Redundant*: tags identified as redundant by GPT-4-Turbo

- *Matched*: inspection by the authors to see if the tags redundant matching or not with the tags suggested

- *FinalTag*: final tag assigned by the authors after the resolution of the conflict

- The sub-directory `code_consolidation_mapping_overview` contains .csv files (`DL_Faults_rq2_overview.csv`, `DRL_Challenges_rq2_overview.csv`, `Functional_rq2_overview.csv`) organized as follows:

- *Initial_Tags*: list of the unique initial tags assigned by GPT-4-Turbo for each dataset

- *Mapped_tags*: list of tags mapped by GPT-4-Turbo

- *Unmatched_tags*: list of unmatched tags by GPT-4-Turbo

- *Aggregating_tags*: list of consolidated tags

- *Final_tags*: list of final tags after the consolidation task

## stats

- `RQ1`: contains script and datasets used to perform metrics for RQ1. The analysis calculates all possible combinations between Matched, More Abstract, More Specific, and Unmatched.

- `RQ1_Stats.ipynb` is a Python Jupyter nooteook to compute the RQ1 metrics. To use it, as explained in the notebook, it is necessary to change the values of variables contained in the first code block.

- `independent-prompting`: Contains the datasets related to the independent prompting. Each line contains the following fields:

- *Link*: Link to the artifact being tagged

- *Prompt*: Prompt sent to GPT-4-Turbo

- *FinalTag*: Artifact coding from the replicated study

- *chatgpt\_output_text*: GPT-4-Turbo output

- *chatgpt\_output*: Codes parsed from the GPT-4-Turbo output

- *Author1*: Annotator 1 evaluation of the coding

- *Author2*: Annotator 2 evaluation of the coding

- *FinalOutput*: Consolidated evaluation

- `incremental-prompting`: Contains the datasets related to the incremental prompting (same format as independent prompting)

- `results`: contains files for the RQ1 quantitative results. The files are named `RQ1\_<

- `RQ2`: contains the script used to perform metrics for RQ2, the datasets it uses, and its output.

- `RQ2_SetStats.ipynb` is the Python Jupyter notebook to perform the analyses. The scripts takes as input the following types of files, contained in the directory contains the script used to perform the metrics for RQ2. The script takes in input:

- RQ1 Data Files (`RQ1_DLFaults_Issues.csv`, `RQ1_DLFaults_Commits.csv`, and `RQ1_DLFaults_SO.csv`, joined in a single .csv `RQ1_DLFaults.csv`). These are the same files used in RQ1.

- Mapping Files (`RQ2_Mappings_DRL.csv`, `RQ2_Mappings_Functional.csv`, `RQ2_Mappings_DLFaults.csv`). These contain the mappings between human tags (*HumanTags*), GPT-4-Turbo tags (*Final Tags*), with indicated the type of matching (*MatchType*).

- Additional codes creating during the consolidation (`RQ2_newCodes_DRL.csv`, `RQ2_newCodes_Functional.csv`, `RQ2_newCodes_DLFaults.csv`), annotated with the matching: *new code*,*old code*,*human code*,*match type*

- Set files (`RQ2_Sets_DRL.csv`, `RQ2_Sets_Functional.csv`, `RQ2_Sets_DLFaults.csv`). Each file contains the following columns:

- *HumanTags*: List of tags from the original dataset

- *InitialTags*: Set of tags from RQ1,

- *ConsolidatedTags*: Tags that have been consolidated,

- *FinalTags*: Final set of tags (results of RQ2, used in RQ3)

- *NewTags*: New tags created during consolidation

- `RQ2_Set_Metrics.csv`: Reports the RQ2 output metrics (Precision, Recall, F1-Score, Jaccard).
Z
Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping...
data.niaid.nih.gov
zenodo.org
Updated Apr 22, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elena Stefancova (2022). Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5996863
Explore at:
Dataset updated
Apr 22, 2022
Dataset provided by
Robert Moro
Maria Bielikova
Matus Tomlein
Branislav Pecher
Jakub Simko
Elena Stefancova
Ivan Srba
Description
Overview

This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).

The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.

Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.

The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).

The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.

Options to access the dataset

There are two ways how to get access to the dataset:

Static dump of the dataset available in the CSV format

Continuously updated dataset available via REST API

In order to obtain an access to the dataset (either to full static dump or REST API), please, request the access by following instructions provided below.

References

If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:

@inproceedings{SrbaMonantPlatform, author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria}, booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)}, pages = {1--7}, title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior}, year = {2019} }

@inproceedings{SrbaMonantMedicalDataset, author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria}, booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)}, numpages = {11}, title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims}, year = {2022}, doi = {10.1145/3477495.3531726}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3477495.3531726}, }

Dataset creation process

In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.

Ethical considerations

The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.

The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.

As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.

Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.

Reporting mistakes in the dataset The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.

Dataset structure

Raw data

At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.

Raw data are contained in these CSV files (and corresponding REST API endpoints):

sources.csv

articles.csv

article_media.csv

article_authors.csv

discussion_posts.csv

discussion_post_authors.csv

fact_checking_articles.csv

fact_checking_article_media.csv

claims.csv

feedback_facebook.csv

Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.

Annotations

Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.

Each annotation is described by the following attributes:

category of annotation (annotation_category). Possible values: label (annotation corresponds to ground truth, determined by human experts) and prediction (annotation was created by means of AI method).

type of annotation (annotation_type_id). Example values: Source reliability (binary), Claim presence. The list of possible values can be obtained from enumeration in annotation_types.csv.

method which created annotation (method_id). Example values: Expert-based source reliability evaluation, Fact-checking article to claim transformation method. The list of possible values can be obtained from enumeration methods.csv.

its value (value). The value is stored in JSON format and its structure differs according to particular annotation type.

At the same time, annotations are associated with a particular object identified by:

entity type (parameter entity_type in case of entity annotations, or source_entity_type and target_entity_type in case of relation annotations). Possible values: sources, articles, fact-checking-articles.

entity id (parameter entity_id in case of entity annotations, or source_entity_id and target_entity_id in case of relation annotations).

The dataset provides specifically these entity annotations:

Source reliability (binary). Determines validity of source (website) at a binary scale with two options: reliable source and unreliable source.

Article veracity. Aggregated information about veracity from article-claim pairs.

The dataset provides specifically these relation annotations:

Fact-checking article to claim mapping. Determines mapping between fact-checking article and claim.

Claim presence. Determines presence of claim in article.

Claim stance. Determines stance of an article to a claim.

Annotations are contained in these CSV files (and corresponding REST API endpoints):

entity_annotations.csv

relation_annotations.csv

Note: Identification of human annotators authors (email provided in the annotation app) is anonymised.
f
MCCN Case Study 3 - Select optimal survey locality
adelaide.figshare.com
zip
Updated May 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Donald Hobern; Alisha Aneja; Hoang Son Le; Rakesh David; Lili Andres Hernandez (2025). MCCN Case Study 3 - Select optimal survey locality [Dataset]. http://doi.org/10.25909/29176451.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.25909/29176451.v1
Dataset updated
May 29, 2025
Dataset provided by
The University of Adelaide
Authors
Donald Hobern; Alisha Aneja; Hoang Son Le; Rakesh David; Lili Andres Hernandez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The MCCN project is to deliver tools to assist the agricultural sector to understand crop-environment relationships, specifically by facilitating generation of data cubes for spatiotemporal data. This repository contains Jupyter notebooks to demonstrate the functionality of the MCCN data cube components.The dataset contains input files for the case study (source_data), RO-Crate metadata (ro-crate-metadata.json), results from the case study (results), and Jupyter Notebook (MCCN-CASE 3.ipynb)Research Activity Identifier (RAiD)RAiD: https://doi.org/10.26292/8679d473Case StudiesThis repository contains code and sample data for the following case studies. Note that the analyses here are to demonstrate the software and result should not be considered scientifically or statistically meaningful. No effort has been made to address bias in samples, and sample data may not be available at sufficient density to warrant analysis. All case studies end with generation of an RO-Crate data package including the source data, the notebook and generated outputs, including netcdf exports of the datacubes themselves.Case Study 3 - Select optimal survey localityGiven a set of existing survey locations across a variable landscape, determine the optimal site to add to increase the range of surveyed environments. This study demonstrates: 1) Loading heterogeneous data sources into a cube, and 2) Analysis and visualisation using numpy and matplotlib.Data SourcesThe primary goal for this case study is to demonstrate being able to import a set of environmental values for different sites and then use these to identify a subset that maximises spread across the various environmental dimensions.This is a simple implementation that uses four environmental attributes imported for all Australia (or a subset like NSW) at a moderate grid scale:Digital soil maps for key soil properties over New South Wales, version 2.0 - SEED - see https://esoil.io/TERNLandscapes/Public/Pages/SLGA/ProductDetails-SoilAttributes.htmlANUCLIM Annual Mean Rainfall raster layer - SEED - see https://datasets.seed.nsw.gov.au/dataset/anuclim-annual-mean-rainfall-raster-layerANUCLIM Annual Mean Temperature raster layer - SEED - see https://datasets.seed.nsw.gov.au/dataset/anuclim-annual-mean-temperature-raster-layerDependenciesThis notebook requires Python 3.10 or higherInstall relevant Python libraries with: pip install mccn-engine rocrateInstalling mccn-engine will install other dependenciesOverviewGenerate STAC metadata for layers from predefined configuratiionLoad data cube and exclude nodata valuesScale all variables to a 0.0-1.0 rangeSelect four layers for comparison (soil organic carbon 0-30 cm, soil pH 0-30 cm, mean annual rainfall, mean annual temperature)Select 10 random points within NSWGenerate 10 new layers representing standardised environmental distance between one of the selected points and all other points in NSWFor every point in NSW, find the lowest environmental distance to any of the selected pointsSelect the point in NSW that has the highest value for the lowest environmental distance to any selected point - this is the most different pointClean up and save results to RO-Crate
Datasets of the CIKM resource paper "A Semantically Enriched Mobility...
zenodo.org
zip
Updated Jun 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francesco Lettich; Francesco Lettich; Chiara Pugliese; Chiara Pugliese; Guido Rocchietti; Guido Rocchietti; Chiara Renso; Chiara Renso; Fabio PINELLI; Fabio PINELLI (2025). Datasets of the CIKM resource paper "A Semantically Enriched Mobility Dataset with Contextual and Social Dimensions" [Dataset]. http://doi.org/10.5281/zenodo.15658129
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15658129
Dataset updated
Jun 16, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Francesco Lettich; Francesco Lettich; Chiara Pugliese; Chiara Pugliese; Guido Rocchietti; Guido Rocchietti; Chiara Renso; Chiara Renso; Fabio PINELLI; Fabio PINELLI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the two semantically enriched trajectory datasets introduced in the CIKM Resource Paper "A Semantically Enriched Mobility Dataset with Contextual and Social Dimensions", by Chiara Pugliese (CNR-IIT), Francesco Lettich (CNR-ISTI), Guido Rocchietti (CNR-ISTI), Chiara Renso (CNR-ISTI), and Fabio Pinelli (IMT Lucca, CNR-ISTI).

The two datasets were generated with an open source pipeline based on the Jupyter notebooks published in the GitHub repository behind our resource paper, and our MAT-Builder system. Overall, our pipeline first generates the files that we provide in the [paris|nyc]_input_matbuilder.zip archives; the files are then passed as input to the MAT-Builder system, which ultimately generates the two semantically enriched trajectory datasets for Paris and New York City, both in tabular and RDF formats. For more details on the input and output data, please see the sections below.

Input data

The [paris|nyc]_input_matbuilder.zip archives contain the data sources we used with the MAT-Builder system to semantically enrich raw preprocessed trajectories. More specifically, the archives contain the following files:

raw_trajectories_[paris|nyc]_matbuilder.parquet: these are the datasets of raw preprocessed trajectories, ready for ingestion by the MAT-Builder system, as outputted by the notebook 5 - Ensure MAT-Builder compatibility.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents the sample of some trajectory, and the dataframe has the following columns:

traj_id: trajectory identifier;

user: user identifier;

lat: latitude of a trajectory sample;

lon: longitude of a trajectory sample;

time: timestamp of a sample;

pois.parqet: these are the POI datasets, ready for ingestion by the MAT-Builder system. outputted by the notebook 6 - Generate dataset POI from OpenStreetMap.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents a POI, and the dataframe has the following columns:

osmid: POI OSM identifier

element_type: POI OSM element type

name: POI native name;

name:en: POI English name;

wikidata: POI WikiData identifier;

geometry: geometry associated with the POI;

category: POI category.

social_[paris|ny].parquet: these are the social media post datasets, ready for ingestion by the MAT-Builder system, outputted by the notebook 9 - Prepare social media dataset for MAT-Builder.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents a single social media post, and the dataframe has the following columns:

tweet_ID: identifier of the post;

text: post's text;

tweet_created: post's timestamp;

uid: identifier of the user who posted.

weather_conditions.parquet: these are the weather conditions datasets, ready for ingestion by the MAT-Builder system, outputted by the notebook 7 - Meteostat daily data downloader.ipynb our GitHub repository, saved in Parquet format. Each row in the dataframe represents the weather conditions recorder in a single day, and the dataframe has the following columns:

DATE: date in which the weather observation was recorded;

TAVG_C: average temperature in celsius;

DESCRIPTION: weather conditions.

Output data: the semantically enriched Paris and New York City datasets

Tabular Representation

The [paris|nyc]_output_tabular.zip zip archives contain the output files generated by MAT-Builder that express the semantically enriched Paris and New York City datasets in tabular format. More specifically, they contain the following files:

traj_cleaned.parquet: parquet file storing the dataframe containing the raw preprocessed trajectories after applying the MAT-Builder's preprocessing step on raw_trajectories_[paris|nyc]_matbuilder.parquet. The dataframe contains the same columns found in raw_trajectories_[paris|nyc]_matbuilder.parquet, except for time which in this dataframe has been renamed to datetime. The operations performed in the MAT-Builder's preprocessing step were:

(1) we filtered out trajectories having less than 2 samples;

(2) we filtered noisy samples inducing velocities above 300km/h:

(3) finally, we compressed the trajectories such that all points within a radius of 20 meters from a given initial point are compressed into a single point that has the median coordinates of all points and the time of the initial point.

stops.parquet: parquet file storing the dataframe containing the stop segments detected from the trajectories by the MAT-Builder's segmentation step. Each row in the dataframe represents a specific stop segment from some trajectory. The columns are:

datetime, which indicates when a stop segments starts;

leaving_datetime, which indicates when a stop segment ends;

uid, the trajectory user's identifier;

tid, the trajectory's identifier;

lat, the stop segment's centroid latitude;

lng, the stop segment's centroid longitude.
NOTE: to uniquely identify a stop segment, you need the triple (stop segment's index in the dataframe, uid, tid).

moves.parquet: parquet file storing the dataframe containing the samples associated with the move segments detected from the trajectories by the MAT-Builder's segmentation step. Each row in the dataframe represents a specific sample beloning to some move segment of some trajectory. The columns are:

datetime, which indicates when a sample's timestamp;

uid, the samples' trajectory user's identifier;

tid, the sample's trajectory's identifier;

lat, the sample's latitude;

lng, the sample's longitude;

move_id, the identifier of a move segment.
NOTE: to uniquely identify a move segment, you need the triple (uid, tid, move_id).

enriched_occasional.parquet: parquet file storing the dataframe containing pairs representing associations between stop segments that have been deemed occasional and POIs found to be close to their centroids. As such, in this dataframe an occasional stop can appear multiple times, i.e., when the are multiple POIs located nearby a stop's centroid. The columns found in this dataframe are the same from stops.parquet, plus two sets of columns.

The first set of columns concerns a stop's charachteristics:

stop_id, which represents the unique identifier of a stop segment and corresponds to the index of said stop in stops.parquet;

geometry_stop, which is a Shapely Point representing a stop's centroid;

geometry, which is the aforementioned Shapely Point plus a 50 meters buffer around it.

There is then a second set of columns which represents the characteristics of the POI that has been associated with a stop. The relevant ones are:

index_poi, which is the index of the associated POI in the pois.parqet file;

osmid, which is the identifier given by OpenStreetMap to the POI;

name, the POI's name;

wikidata, the POI identifier on wikidata;

category, the POI's category;

geometry_poi, a Shapely (multi)polygon describing the POI's geometry;

distance, the distance between the stop segment's centroid and the POI.

enriched_systematic.parquet: parquet file storing the dataframe containing pairs representing associations between stop segments that have been deemed systematic and POIs found to be close to their centroids. This dataframe has exactly the same characteristics of enriched_occasional.parquet, plus the following columns:

systematic_id, the identifier of the cluster of systematic stops a systematic stop belongs to;

frequency, the number of systematic stops within a systematic stop's cluster;

home, the probability that the systematic stop's cluster represents the home of the associated user;

work, the probability that the systematic stop's cluster represents the workplace of the associated user;

other,
o
ERA-NUTS: meteorological time-series based on C3S ERA5 for European regions...
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Feb 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M. M. De Felice; K. K. Kavvadias (2022). ERA-NUTS: meteorological time-series based on C3S ERA5 for European regions (1980-2021) [Dataset]. http://doi.org/10.5281/zenodo.5947354
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5947354
Dataset updated
Feb 2, 2022
Authors
M. M. De Felice; K. K. Kavvadias
Description
ERA-NUTS (1980-2021) This dataset contains a set of time-series of meteorological variables based on Copernicus Climate Change Service (C3S) ERA5 reanalysis. The data files can be downloaded from here while notebooks and other files can be found on the associated Github repository. This data has been generated with the aim of providing hourly time-series of the meteorological variables commonly used for power system modelling and, more in general, studies on energy systems. An example of the analysis that can be performed with ERA-NUTS is shown in this video. Important: this dataset is still a work-in-progress, we will add more analysis and variables in the near-future. If you spot an error or something strange in the data please tell us sending an email or opening an Issue in the associated Github repository. ## Data The time-series have hourly/daily/monthly frequency and are aggregated following the NUTS 2016 classification. NUTS (Nomenclature of Territorial Units for Statistics) is a European Union standard for referencing the subdivisions of countries (member states, candidate countries and EFTA countries). This dataset contains NUTS0/1/2 time-series for the following variables obtained from the ERA5 reanalysis data (in brackets the name of the variable on the Copernicus Data Store and its unit measure): - t2m: 2-meter temperature (2m_temperature, Celsius degrees) - ssrd: Surface solar radiation (surface_solar_radiation_downwards, Watt per square meter) - ssrdc: Surface solar radiation clear-sky (surface_solar_radiation_downward_clear_sky, Watt per square meter) - ro: Runoff (runoff, millimeters) There are also a set of derived variables: - ws10: Wind speed at 10 meters (derived by 10m_u_component_of_wind and 10m_v_component_of_wind, meters per second) - ws100: Wind speed at 100 meters (derived by 100m_u_component_of_wind and 100m_v_component_of_wind, meters per second) - CS: Clear-Sky index (the ratio between the solar radiation and the solar radiation clear-sky) - HDD/CDD: Heating/Cooling Degree days (derived by 2-meter temperature the EUROSTAT definition. For each variable we have 367 440 hourly samples (from 01-01-1980 00:00:00 to 31-12-2021 23:00:00) for 34/115/309 regions (NUTS 0/1/2). The data is provided in two formats: - NetCDF version 4 (all the variables hourly and CDD/HDD daily). NOTE: the variables are stored as int16 type using a scale_factor to minimise the size of the files. - Comma Separated Value ("single index" format for all the variables and the time frequencies and "stacked" only for daily and monthly) All the CSV files are stored in a zipped file for each variable. ## Methodology The time-series have been generated using the following workflow: 1. The NetCDF files are downloaded from the Copernicus Data Store from the ERA5 hourly data on single levels from 1979 to present dataset 2. The data is read in R with the climate4r packages and aggregated using the function /get_ts_from_shp from panas. All the variables are aggregated at the NUTS boundaries using the average except for the runoff, which consists of the sum of all the grid points within the regional/national borders. 3. The derived variables (wind speed, CDD/HDD, clear-sky) are computed and all the CSV files are generated using R 4. The NetCDF are created using xarray in Python 3.8. ## Example notebooks In the folder notebooks on the associated Github repository there are two Jupyter notebooks which shows how to deal effectively with the NetCDF data in xarray and how to visualise them in several ways by using matplotlib or the enlopy package. There are currently two notebooks: - exploring-ERA-NUTS: it shows how to open the NetCDF files (with Dask), how to manipulate and visualise them. - ERA-NUTS-explore-with-widget: explorer interactively the datasets with jupyter and ipywidgets. The notebook exploring-ERA-NUTS is also available rendered as HTML. ## Additional files In the folder additional fileson the associated Github repository there is a map showing the spatial resolution of the ERA5 reanalysis and a CSV file specifying the number of grid points with respect to each NUTS0/1/2 region. ## License This dataset is released under CC-BY-4.0 license.

Facebook

Twitter

Click to copy link

Link copied

Cite

Anonymous; Anonymous (2024). Replication Package for ML-EUP Conversational Agent Study [Dataset]. http://doi.org/10.5281/zenodo.7780223

Replication Package for ML-EUP Conversational Agent Study

Explore at:

pdfAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7780223

Dataset updated

Jul 12, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Anonymous; Anonymous

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Replication Package Files

1. Forms.zip: contains the forms used to collect data for the experiment
2. Experiments.zip: contains the participants’ and sandboxers’ experimental task workflow with Newton.
3. Responses.zip: contains the responses collected from participants during the experiments.
4. Analysis.zip: contains the data analysis scripts and results of the experiments.
5. newton.zip: contains the tool we used for the WoZ experiment.
TutorialStudy.pdf: script used in the experiment with and without Newton to be consistent with all participants.
Woz_Script.pdf: script wizard used to maintain consistent Newton responses among the participants.

1. Forms.zip

The forms zip contains the following files:

Demographics.pdf: a PDF form used to collect demographic information from participants before the experiments
Post-Task Control (without the tool).pdf: a PDF form used to collect data from participants about challenges and interactions when performing the task without Newton
Post-Task Newton (with the tool).pdf: a PDF form used to collect data from participants after the task with Newton.
Post-Study Questionnaire.pdf: a PDF form used to collect data from the participant after the experiment.

2. Experiments.zip

The experiments zip contains two types of folders:

exp[participant’s number]-c[number of dataset used for control task]e[number of dataset used for experimental task]. Example: exp1-c2e1 (experiment participant 1 - control used dataset 2, experimental used dataset 1)
sandboxing[sandboxer’s number]. Example: sandboxing1 (experiment with sandboxer 1)

Every experiment subfolder contains:

warmup.json: a JSON file with the results of Newton-Participant interactions in the chat for the warmup task.
warmup.ipynb: a Jupyter notebook file with the participant’s results from the code provided by Newton in the warmup task.
sample1.csv: Death Event dataset.
sample2.csv: Heart Disease dataset.
tool.ipynb: a Jupyter notebook file with the participant’s results from the code provided by Newton in the experimental task.
python.ipynb: a Jupyter notebook file with the participant’s results from the code they tried during the control task.
results.json: a JSON file with the results of Newton-Participant interactions in the chat for the task with Newton.

To load an experiment chat log into Newton, add the following code to the notebook:

import anachat
import json

with open("result.json", "r") as f:
  anachat.comm.COMM.history = json.load(f)

Then, click on the notebook name inside Newton chat

Note 1: the subfolder for P6 is exp6-e2c1-serverdied because the experiment server died before we were able to save the logs. We reconstructed them using the notebook newton_remake.ipynb based on the video recording.

Note 2: The sandboxing occurred during the development of Newton. We did not collect all the files, and the format of JSON files is different than the one supported by the attached version of Newton.

3. Responses.zip

The responses zip contains the following files:

demographics.csv: a CSV file containing the responses collected from participants using the demographics form
task_newton.csv: a CSV file containing the responses collected from participants using the post-task newton form.
task_control.csv: a CSV file containing the responses collected from participants using the post-task control form.
post_study.csv: a CSV file containing the responses collected from participants using the post-study control form.

4. Analysis.zip

The analysis zip contains the following files:

1.Challenge.ipynb: a Jupyter notebook file where the perceptions of challenges figure was created.
2.Interactions.py: a Python file where the participants’ JSON files were created.
3.Interactions.Graph.ipynb: a Jupyter notebook file where the participant’s interaction figure was created.
4.Interactions.Count.ipynb: a Jupyter notebook file that counts participants’ interaction with each figure.
config_interactions.py: this file contains the definitions of interaction colors and grouping
interactions.json: a JSON file with the interactions during the Newton task of each participant based on the categorization.
requirements.txt: dependencies required to run the code to generate the graphs and json analysis.

To run the analyses, install the dependencies on python 3.10 with the following command and execute the scripts and notebooks in order.:

pip install -r requirements.txt

5. newton.zip

The newton zip contains the source code of the Jupyter Lab extension we used in the experiments. Read the README.md file inside it for instructions on how to install and run it.

Clear search

Close search

Google apps

Main menu

Replication Package for ML-EUP Conversational Agent Study

Exploratory Topic Modelling in Python Dataset - EHRI-3

Dataset for "Beyond Self-Promotion: How Software Engineering Research Is...

McCulloch et al 2022 UM post-processed Mars dataset

SUMMA Simulations using CAMELS Datasets for HPC use with CyberGIS-Jupyter...

Blockchain Address Poisoning (Companion Dataset)

OpenAXES Example Robot Dataset

Data

Evaluation

Tools

Dataset and jupyter notebooks to generate the plots of the study: "A...

Datasets for Deep Learning Based Radio Frequency Side-Channel Attack on...

Data from: T1DiabetesGranada: a longitudinal multi-modal dataset of type 1...

Jupyter Notebooks for the Retrieval of AORC Data for Hydrologic Analysis

Code for JupyterNotebook for scRNA-seq data analysis_Oviduct Pgr cKO and...

Embrapa ADD 256 Dataset

HIDRA simulations and post-processing scripts for JGR: SP manuscript:...

Data from: COVID-CO2-ferries

Datasets and scripts related to the paper: "*Can Generative AI Help us in...

Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping...

MCCN Case Study 3 - Select optimal survey locality

Datasets of the CIKM resource paper "A Semantically Enriched Mobility...

Input data

Output data: the semantically enriched Paris and New York City datasets

Tabular Representation

ERA-NUTS: meteorological time-series based on C3S ERA5 for European regions...

Replication Package for ML-EUP Conversational Agent Study