The INTEGRATE (Inverse Network Transformations for Efficient Generation of Robust Airfoil and Turbine Enhancements) project is developing a new inverse-design capability for the aerodynamic design of wind turbine rotors using invertible neural networks. This AI-based design technology can capture complex non-linear aerodynamic effects while being 100 times faster than design approaches based on computational fluid dynamics. This project enables innovation in wind turbine design by accelerating time to market through higher-accuracy early design iterations to reduce the levelized cost of energy. INVERTIBLE NEURAL NETWORKS Researchers are leveraging a specialized invertible neural network (INN) architecture along with the novel dimension-reduction methods and airfoil/blade shape representations developed by collaborators at the National Institute of Standards and Technology (NIST) learns complex relationships between airfoil or blade shapes and their associated aerodynamic and structural properties. This INN architecture will accelerate designs by providing a cost-effective alternative to current industrial aerodynamic design processes, including: Blade element momentum (BEM) theory models: limited effectiveness for design of offshore rotors with large, flexible blades where nonlinear aerodynamic effects dominate Direct design using computational fluid dynamics (CFD): cost-prohibitive Inverse-design models based on deep neural networks (DNNs): attractive alternative to CFD for 2D design problems, but quickly overwhelmed by the increased number of design variables in 3D problems AUTOMATED COMPUTATIONAL FLUID DYNAMICS FOR TRAINING DATA GENERATION - MERCURY FRAMEWORK The INN is trained on data obtained using the University of Marylands (UMD) Mercury Framework, which has with robust automated mesh generation capabilities and advanced turbulence and transition models validated for wind energy applications. Mercury is a multi-mesh paradigm, heterogeneous CPU-GPU framework. The framework incorporates three flow solvers at UMD, 1) OverTURNS, a structured solver on CPUs, 2) HAMSTR, a line based unstructured solver on CPUs, and 3) GARFIELD, a structured solver on GPUs. The framework is based on Python, that is often used to wrap C or Fortran codes for interoperability with other solvers. Communication between multiple solvers is accomplished with a Topology Independent Overset Grid Assembler (TIOGA). NOVEL AIRFOIL SHAPE REPRESENTATIONS USING GRASSMAN SPACES We developed a novel representation of shapes which decouples affine-style deformations from a rich set of data-driven deformations over a submanifold of the Grassmannian. The Grassmannian representation as an analytic generative model, informed by a database of physically relevant airfoils, offers (i) a rich set of novel 2D airfoil deformations not previously captured in the data , (ii) improved low-dimensional parameter domain for inferential statistics informing design/manufacturing, and (iii) consistent 3D blade representation and perturbation over a sequence of nominal shapes. TECHNOLOGY TRANSFER DEMONSTRATION - COUPLING WITH NREL WISDEM Researchers have integrated the inverse-design tool for 2D airfoils (INN-Airfoil) into WISDEM (Wind Plant Integrated Systems Design and Engineering Model), a multidisciplinary design and optimization framework for assessing the cost of energy, as part of tech-transfer demonstration. The integration of INN-Airfoil into WISDEM allows for the design of airfoils along with the blades that meet the dynamic design constraints on cost of energy, annual energy production, and the capital costs. Through preliminary studies, researchers have shown that the coupled INN-Airfoil + WISDEM approach reduces the cost of energy by around 1% compared to the conventional design approach. This page will serve as a place to easily access all the publications from this work and the repositories for the software developed and released through this project.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a csv file resulted from a linear transformation y = 3*x+6 of 1000 randomly generated number between 0 - 100. It was generated by applying a linear transformation on 1000 data points generated from random.randint() function.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Big data images for conduction heat transfer The related paper has been published here: M. Edalatifar, M.B. Tavakoli, M. Ghalambaz, F. Setoudeh, Using deep learning to learn physics of conduction heat transfer, Journal of Thermal Analysis and Calorimetry; 2020. https://doi.org/10.1007/s10973-020-09875-6 Steps to reproduce: The dataset is saved in two format, .npz for python and .mat for matlab. *.mat has large size, then it is compressed with winzip. ReadDataset_Python.py and ReadDataset_Matlab.m are examples of read data using python and matlab respectively. For use dataset in matlab download Dataset/HeatTransferPhenomena_35_58.zip, unzip it and then use ReadDataset_Matlab.m as an example. In case of python, download Dataset/HeatTransferPhenomena_35_58.npz and run ReadDataset_Python.py.
The 'Use of Deep Learning for structural analysis of CT-images of soil samples' used a set of soil sample data (CT-images). All the data and programs used here are open source and were created with the help of open source software. All steps are made by Python programs which are included in the data set.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
The Free-living Food Intake Cycle (FreeFIC) dataset was created by the Multimedia Understanding Group towards the investigation of in-the-wild eating behavior. This is achieved by recording the subjects’ meals as a small part part of their everyday life, unscripted, activities. The FreeFIC dataset contains the (3D) acceleration and orientation velocity signals ((6) DoF) from (22) in-the-wild sessions provided by (12) unique subjects. All sessions were recorded using a commercial smartwatch ((6) using the Huawei Watch 2™ and the MobVoi TicWatch™ for the rest) while the participants performed their everyday activities. In addition, FreeFIC also contains the start and end moments of each meal session as reported by the participants.
Description
FreeFIC includes (22) in-the-wild sessions that belong to (12) unique subjects. Participants were instructed to wear the smartwatch to the hand of their preference well ahead before any meal and continue to wear it throughout the day until the battery is depleted. In addition, we followed a self-report labeling model, meaning that the ground truth is provided from the participant by documenting the start and end moments of their meals to the best of their abilities as well as the hand they wear the smartwatch on. The total duration of the (22) recordings sums up to (112.71) hours, with a mean duration of (5.12) hours. Additional data statistics can be obtained by executing the provided python script stats_dataset.py. Furthermore, the accompanying python script viz_dataset.py will visualize the IMU signals and ground truth intervals for each of the recordings. Information on how to execute the Python scripts can be found below.
$ python stats_dataset.py
$ python viz_dataset.py
FreeFIC is also tightly related to Food Intake Cycle (FIC), a dataset we created in order to investigate the in-meal eating behavior. More information about FIC can be found here and here.
Publications
If you plan to use the FreeFIC dataset or any of the resources found in this page, please cite our work:
@article{kyritsis2020data,
title={A Data Driven End-to-end Approach for In-the-wild Monitoring of Eating Behavior Using Smartwatches},
author={Kyritsis, Konstantinos and Diou, Christos and Delopoulos, Anastasios},
journal={IEEE Journal of Biomedical and Health Informatics},
year={2020},
publisher={IEEE}}
@inproceedings{kyritsis2017automated,
title={Detecting Meals In the Wild Using the Inertial Data of a Typical Smartwatch},
author={Kyritsis, Konstantinos and Diou, Christos and Delopoulos, Anastasios},
booktitle={2019 41th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)},
year={2019},
organization={IEEE}}
Technical details
We provide the FreeFIC dataset as a pickle. The file can be loaded using Python in the following way:
import pickle as pkl import numpy as np
with open('./FreeFIC_FreeFIC-heldout.pkl','rb') as fh: dataset = pkl.load(fh)
The dataset variable in the snipet above is a dictionary with (5) keys. Namely:
'subject_id'
'session_id'
'signals_raw'
'signals_proc'
'meal_gt'
The contents under a specific key can be obtained by:
sub = dataset['subject_id'] # for the subject id ses = dataset['session_id'] # for the session id raw = dataset['signals_raw'] # for the raw IMU signals proc = dataset['signals_proc'] # for the processed IMU signals gt = dataset['meal_gt'] # for the meal ground truth
The sub, ses, raw, proc and gt variables in the snipet above are lists with a length equal to (22). Elements across all lists are aligned; e.g., the (3)rd element of the list under the 'session_id' key corresponds to the (3)rd element of the list under the 'signals_proc' key.
sub: list Each element of the sub list is a scalar (integer) that corresponds to the unique identifier of the subject that can take the following values: ([1, 2, 3, 4, 13, 14, 15, 16, 17, 18, 19, 20]). It should be emphasized that the subjects with ids (15, 16, 17, 18, 19) and (20) belong to the held-out part of the FreeFIC dataset (more information can be found in ( )the publication titled "A Data Driven End-to-end Approach for In-the-wild Monitoring of Eating Behavior Using Smartwatches" by Kyritsis et al). Moreover, the subject identifier in FreeFIC is in-line with the subject identifier in the FIC dataset (more info here and here); i.e., FIC’s subject with id equal to (2) is the same person as FreeFIC’s subject with id equal to (2).
ses: list Each element of this list is a scalar (integer) that corresponds to the unique identifier of the session that can range between (1) and (5). It should be noted that not all subjects have the same number of sessions.
raw: list Each element of this list is dictionary with the 'acc' and 'gyr' keys. The data under the 'acc' key is a (N_{acc} \times 4) numpy.ndarray that contains the timestamps in seconds (first column) and the (3D) raw accelerometer measurements in (g) (second, third and forth columns - representing the (x, y ) and (z) axis, respectively). The data under the 'gyr' key is a (N_{gyr} \times 4) numpy.ndarray that contains the timestamps in seconds (first column) and the (3D) raw gyroscope measurements in ({degrees}/{second})(second, third and forth columns - representing the (x, y ) and (z) axis, respectively). All sensor streams are transformed in such a way that reflects all participants wearing the smartwatch at the same hand with the same orientation, thusly achieving data uniformity. This transformation is in par with the signals in the FIC dataset (more info here and here). Finally, the length of the raw accelerometer and gyroscope numpy.ndarrays is different ((N_{acc} eq N_{gyr})). This behavior is predictable and is caused by the Android platform.
proc: list Each element of this list is an (M\times7) numpy.ndarray that contains the timestamps, (3D) accelerometer and gyroscope measurements for each meal. Specifically, the first column contains the timestamps in seconds, the second, third and forth columns contain the (x,y) and (z) accelerometer values in (g) and the fifth, sixth and seventh columns contain the (x,y) and (z) gyroscope values in ({degrees}/{second}). Unlike elements in the raw list, processed measurements (in the proc list) have a constant sampling rate of (100) Hz and the accelerometer/gyroscope measurements are aligned with each other. In addition, all sensor streams are transformed in such a way that reflects all participants wearing the smartwatch at the same hand with the same orientation, thusly achieving data uniformity. This transformation is in par with the signals in the FIC dataset (more info here and here). No other preprocessing is performed on the data; e.g., the acceleration component due to the Earth's gravitational field is present at the processed acceleration measurements. The potential researcher can consult the article "A Data Driven End-to-end Approach for In-the-wild Monitoring of Eating Behavior Using Smartwatches" by Kyritsis et al. on how to further preprocess the IMU signals (i.e., smooth and remove the gravitational component).
meal_gt: list Each element of this list is a (K\times2) matrix. Each row represents the meal intervals for the specific in-the-wild session. The first column contains the timestamps of the meal start moments whereas the second one the timestamps of the meal end moments. All timestamps are in seconds. The number of meals (K) varies across recordings (e.g., a recording exist where a participant consumed two meals).
Ethics and funding
Informed consent, including permission for third-party access to anonymised data, was obtained from all subjects prior to their engagement in the study. The work has received funding from the European Union's Horizon 2020 research and innovation programme under Grant Agreement No 727688 - BigO: Big data against childhood obesity.
Contact
Any inquiries regarding the FreeFIC dataset should be addressed to:
Dr. Konstantinos KYRITSIS
Multimedia Understanding Group (MUG) Department of Electrical & Computer Engineering Aristotle University of Thessaloniki University Campus, Building C, 3rd floor Thessaloniki, Greece, GR54124
Tel: +30 2310 996359, 996365 Fax: +30 2310 996398 E-mail: kokirits [at] mug [dot] ee [dot] auth [dot] gr
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A simulated SEND + EDX dataset along with the code used to produce it.
Added in newer version: the VAE processing of the SEND data has been included
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this package are the datasets extracted from the Artleafs database, as well as the RO-Crate packages and the RDF dataset generated from them for the project to create RO-Crates using the PHCAT ontology. You can also find python scripts used to transform the extracted CSV data into a new RDF dataset allowing you to create more RO-Crate packages if desired.
-./data: contains the set of data extracted from the database in CSV format.
./resources: contains the generated RO-Crate packages as well as the mapping files used and the RDF subsets of each article.
./OutputPhotocatalysisMapping.ttl: is the file in turtle format in charge of storing the global RDF data set after the translation of the database data.
The rest of the folders and files contain mapping rules and scripts used in the data transformation process. For more information check the following GitHub repository: https://github.com/oeg-upm/photocatalysis-ontology.
These data were used to examine grammatical structures and patterns within a set of geospatial glossary definitions. Objectives of our study were to analyze the semantic structure of input definitions, use this information to build triple structures of RDF graph data, upload our lexicon to a knowledge graph software, and perform SPARQL queries on the data. Upon completion of this study, SPARQL queries were proven to effectively convey graph triples which displayed semantic significance. These data represent and characterize the lexicon of our input text which are used to form graph triples. These data were collected in 2024 by passing text through multiple Python programs utilizing spaCy (a natural language processing library) and its pre-trained English transformer pipeline. Before data was processed by the Python programs, input definitions were first rewritten as natural language and formatted as tabular data. Passages were then tokenized and characterized by their part-of-speech, tag, dependency relation, dependency head, and lemma. Each word within the lexicon was tokenized. A stop-words list was utilized only to remove punctuation and symbols from the text, excluding hyphenated words (ex. bowl-shaped) which remained as such. The tokens’ lemmas were then aggregated and totaled to find their recurrences within the lexicon. This procedure was repeated for tokenizing noun chunks using the same glossary definitions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This distribution provides the code and reference data for the manuscript of "Understanding the interplay between electrocatalytic C(sp3)‒C(sp3) fragmentation and oxygenation reactions".
The code can be excecuted using Python version 3.8.
Data
The distribution includes an .xlsx
file with reference mass spectra data and experimental data in .tsv
format.
Usage
example.ipynb
. A pop-up window will prompt you to upload your experimental mass spectra data; select your file accordingly. The cells are organized as:
Load Data
: This step loads the reference spectra data.Preprocess Data
: This step removes background signals and smooths the signal.Optimization
: This step uses constrained least squares optimization to reconstruct spectra and predict flux.Plot
: This step displays the spectra reconstruction and flux prediction.File Output
: This step saves the predicted flux and reconstructed spectra to files.Market basket analysis with Apriori algorithm
The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.
Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.
Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.
Number of Attributes: 7
https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">
First, we need to load required libraries. Shortly I describe all libraries.
https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">
Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.
https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png">
https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">
After we will clear our data frame, will remove missing values.
https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">
To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
T1DiabetesGranada
A longitudinal multi-modal dataset of type 1 diabetes mellitus
Documented by:
Rodriguez-Leon, C., Aviles-Perez, M. D., Banos, O., Quesada-Charneco, M., Lopez-Ibarra, P. J., Villalonga, C., & Munoz-Torres, M. (2023). T1DiabetesGranada: a longitudinal multi-modal dataset of type 1 diabetes mellitus. Scientific Data, 10(1), 916. https://doi.org/10.1038/s41597-023-02737-4
Background
Type 1 diabetes mellitus (T1D) patients face daily difficulties in keeping their blood glucose levels within appropriate ranges. Several techniques and devices, such as flash glucose meters, have been developed to help T1D patients improve their quality of life. Most recently, the data collected via these devices is being used to train advanced artificial intelligence models to characterize the evolution of the disease and support its management. The main problem for the generation of these models is the scarcity of data, as most published works use private or artificially generated datasets. For this reason, this work presents T1DiabetesGranada, a open under specific permission longitudinal dataset that not only provides continuous glucose levels, but also patient demographic and clinical information. The dataset includes 257780 days of measurements over four years from 736 T1D patients from the province of Granada, Spain. This dataset progresses significantly beyond the state of the art as one the longest and largest open datasets of continuous glucose measurements, thus boosting the development of new artificial intelligence models for glucose level characterization and prediction.
Data Records
The data are stored in four comma-separated values (CSV) files which are available in T1DiabetesGranada.zip. These files are described in detail below.
Patient_info.csv
Patient_info.csv is the file containing information about the patients, such as demographic data, start and end dates of blood glucose level measurements and biochemical parameters, number of biochemical parameters or number of diagnostics. This file is composed of 736 records, one for each patient in the dataset, and includes the following variables:
Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.
Sex – Sex of the patient. Values: F (for female), masculine (for male)
Birth_year – Year of birth of the patient. Format: YYYY.
Initial_measurement_date – Date of the first blood glucose level measurement of the patient in the Glucose_measurements.csv file. Format: YYYY-MM-DD.
Final_measurement_date – Date of the last blood glucose level measurement of the patient in the Glucose_measurements.csv file. Format: YYYY-MM-DD.
Number_of_days_with_measures – Number of days with blood glucose level measurements of the patient, extracted from the Glucose_measurements.csv file. Values: ranging from 8 to 1463.
Number_of_measurements – Number of blood glucose level measurements of the patient, extracted from the Glucose_measurements.csv file. Values: ranging from 400 to 137292.
Initial_biochemical_parameters_date – Date of the first biochemical test to measure some biochemical parameter of the patient, extracted from the Biochemical_parameters.csv file. Format: YYYY-MM-DD.
Final_biochemical_parameters_date – Date of the last biochemical test to measure some biochemical parameter of the patient, extracted from the Biochemical_parameters.csv file. Format: YYYY-MM-DD.
Number_of_biochemical_parameters – Number of biochemical parameters measured on the patient, extracted from the Biochemical_parameters.csv file. Values: ranging from 4 to 846.
Number_of_diagnostics – Number of diagnoses realized to the patient, extracted from the Diagnostics.csv file. Values: ranging from 1 to 24.
Glucose_measurements.csv
Glucose_measurements.csv is the file containing the continuous blood glucose level measurements of the patients. The file is composed of more than 22.6 million records that constitute the time series of continuous blood glucose level measurements. It includes the following variables:
Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.
Measurement_date – Date of the blood glucose level measurement. Format: YYYY-MM-DD.
Measurement_time – Time of the blood glucose level measurement. Format: HH:MM:SS.
Measurement – Value of the blood glucose level measurement in mg/dL. Values: ranging from 40 to 500.
Biochemical_parameters.csv
Biochemical_parameters.csv is the file containing data of the biochemical tests performed on patients to measure their biochemical parameters. This file is composed of 87482 records and includes the following variables:
Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.
Reception_date – Date of receipt in the laboratory of the sample to measure the biochemical parameter. Format: YYYY-MM-DD.
Name – Name of the measured biochemical parameter. Values: 'Potassium', 'HDL cholesterol', 'Gammaglutamyl Transferase (GGT)', 'Creatinine', 'Glucose', 'Uric acid', 'Triglycerides', 'Alanine transaminase (GPT)', 'Chlorine', 'Thyrotropin (TSH)', 'Sodium', 'Glycated hemoglobin (Ac)', 'Total cholesterol', 'Albumin (urine)', 'Creatinine (urine)', 'Insulin', 'IA ANTIBODIES'.
Value – Value of the biochemical parameter. Values: ranging from -4.0 to 6446.74.
Diagnostics.csv
Diagnostics.csv is the file containing diagnoses of diabetes mellitus complications or other diseases that patients have in addition to type 1 diabetes mellitus. This file is composed of 1757 records and includes the following variables:
Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.
Code – ICD-9-CM diagnosis code. Values: subset of 594 of the ICD-9-CM codes (https://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes).
Description – ICD-9-CM long description. Values: subset of 594 of the ICD-9-CM long description (https://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes).
Technical Validation
Blood glucose level measurements are collected using FreeStyle Libre devices, which are widely used for healthcare in patients with T1D. Abbott Diabetes Care, Inc., Alameda, CA, USA, the manufacturer company, has conducted validation studies of these devices concluding that the measurements made by their sensors compare to YSI analyzer devices (Xylem Inc.), the gold standard, yielding results of 99.9% of the time within zones A and B of the consensus error grid. In addition, other studies external to the company concluded that the accuracy of the measurements is adequate.
Moreover, it was also checked in most cases the blood glucose level measurements per patient were continuous (i.e. a sample at least every 15 minutes) in the Glucose_measurements.csv file as they should be.
Usage Notes
For data downloading, it is necessary to be authenticated on the Zenodo platform, accept the Data Usage Agreement and send a request specifying full name, email, and the justification of the data use. This request will be processed by the Secretary of the Department of Computer Engineering, Automatics, and Robotics of the University of Granada and access to the dataset will be granted.
The files that compose the dataset are CSV type files delimited by commas and are available in T1DiabetesGranada.zip. A Jupyter Notebook (Python v. 3.8) with code that may help to a better understanding of the dataset, with graphics and statistics, is available in UsageNotes.zip.
Graphs_and_stats.ipynb
The Jupyter Notebook generates tables, graphs and statistics for a better understanding of the dataset. It has four main sections, one dedicated to each file in the dataset. In addition, it has useful functions such as calculating the patient age, deleting a patient list from a dataset file and leaving only a patient list in a dataset file.
Code Availability
The dataset was generated using some custom code located in CodeAvailability.zip. The code is provided as Jupyter Notebooks created with Python v. 3.8. The code was used to conduct tasks such as data curation and transformation, and variables extraction.
Original_patient_info_curation.ipynb
In the Jupyter Notebook is preprocessed the original file with patient data. Mainly irrelevant rows and columns are removed, and the sex variable is recoded.
Glucose_measurements_curation.ipynb
In the Jupyter Notebook is preprocessed the original file with the continuous glucose level measurements of the patients. Principally rows without information or duplicated rows are removed and the variable with the timestamp is transformed into two new variables, measurement date and measurement time.
Biochemical_parameters_curation.ipynb
In the Jupyter Notebook is preprocessed the original file with patient data of the biochemical tests performed on patients to measure their biochemical parameters. Mainly irrelevant rows and columns are removed and the variable with the name of the measured biochemical parameter is translated.
Diagnostic_curation.ipynb
In the Jupyter Notebook is preprocessed the original file with patient data of the diagnoses of diabetes mellitus complications or other diseases that patients have in addition to T1D.
Get_patient_info_variables.ipynb
In the Jupyter Notebook it is coded the feature extraction process from the files Glucose_measurements.csv, Biochemical_parameters.csv and Diagnostics.csv to complete the file Patient_info.csv. It is divided into six sections, the first three to extract the features from each of the mentioned files and the next three to add the extracted features to the resulting new file.
Data Usage Agreement
The conditions for use are as follows:
You confirm that you will not attempt to re-identify research participants for any reason, including for re-identification theory research.
You commit to keeping the T1DiabetesGranada dataset confidential and secure and will not redistribute data or Zenodo account credentials.
You will require
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This project focuses on data mapping, integration, and analysis to support the development and enhancement of six UNCDF operational applications: OrgTraveler, Comms Central, Internal Support Hub, Partnership 360, SmartHR, and TimeTrack. These apps streamline workflows for travel claims, internal support, partnership management, and time tracking within UNCDF.Key Features and Tools:Data Mapping for Salesforce CRM Migration: Structured and mapped data flows to ensure compatibility and seamless migration to Salesforce CRM.Python for Data Cleaning and Transformation: Utilized pandas, numpy, and APIs to clean, preprocess, and transform raw datasets into standardized formats.Power BI Dashboards: Designed interactive dashboards to visualize workflows and monitor performance metrics for decision-making.Collaboration Across Platforms: Integrated Google Collab for code collaboration and Microsoft Excel for data validation and analysis.
💁♀️Please take a moment to carefully read through this description and metadata to better understand the dataset and its nuances before proceeding to the Suggestions and Discussions section.
This dataset provides a comprehensive collection of setlists from Taylor Swift’s official era tours, curated expertly by Spotify. The playlist, available on Spotify under the title "Taylor Swift The Eras Tour Official Setlist," encompasses a diverse range of songs that have been performed live during the tour events of this global artist. Each dataset entry corresponds to a song featured in the playlist.
Taylor Swift, a pivotal figure in both country and pop music scenes, has had a transformative impact on the music industry. Her tours are celebrated not just for their musical variety but also for their theatrical elements, narrative style, and the deep emotional connection they foster with fans worldwide. This dataset aims to provide fans and researchers an insight into the evolution of Swift's musical and performance style through her tours, capturing the essence of what makes her tour unique.
Obtaining the Data: The data was obtained directly from the Spotify Web API, specifically focusing on the setlist tracks by the artist. The Spotify API provides detailed information about tracks, artists, and albums through various endpoints.
Data Processing: To process and structure the data, Python scripts were developed using data science libraries such as pandas for data manipulation and spotipy for API interactions, specifically for Spotify data retrieval.
Workflow:
Authentication API Requests Data Cleaning and Transformation Saving the Data
Note: Popularity score reflects the score recorded on the day that retrieves this dataset. The popularity score could fluctuate daily.
This dataset, derived from Spotify focusing on Taylor Swift's The Eras Tour setlist data, is intended for educational, research, and analysis purposes only. Users are urged to use this data responsibly, ethically, and within the bounds of legal stipulations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data repository offers comprehensive resources, including datasets, Python scripts, and models associated with the study entitled, "Positive effects of public breeding on U.S. rice yields under future climate scenarios". The repository contains three models: a PCA model for data transformation, along with two meta-machine learning models for predictive analysis. Additionally, three Python scripts are available to facilitate the creation of training datasets and machine-learning models. The repository also provides tabulated weather, genetic, and county-level rice yield information specific to the southern U.S. region, which serves as the primary data inputs for our research. The focus of our study lies in modeling and predicting rice yields, incorporating factors such as molecular marker variation, varietal productivity, and climate, particularly within the Southern U.S. rice growing region. This region encompasses Arkansas, Louisiana, Texas, Mississippi, and Missouri, which collectively account for 85% of total U.S. rice production. By digitizing and merging county-level variety acreage data from 1970 to 2015 with genotyping-by-sequencing data, we estimate annual county-level allele frequencies. These frequencies, in conjunction with county-level weather and yield data, are employed to develop ten machine-learning models for yield prediction. An ensemble model, consisting of a two-layer meta-learner, combines the predictions of all ten models and undergoes external evaluation using historical Uniform Regional Rice Nursery trials (1980-2018) conducted within the same states. Lastly, the ensemble model, coupled with forecasted weather data from the Coupled Model Intercomparison Project, is employed to predict future production across the 110 rice-growing counties, considering various groups of germplasm.
This study was supported by USDA NIFA 2014-67003-21858 and USDA NIFA 2022-67013-36205.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The NumFocus dataset provides a comprehensive representation of contributor activity across 58 open-source projects supported by the NumFocus organization. Spanning a three-year observation period (January 2022 to December 2024), this dataset captures the dynamics of open-source collaboration within a defined community of scientific and data-driven software projects.
To address the challenges of interpreting raw GitHub event logs, the dataset introduces two structured levels of abstraction: actions and activities. Actions offer a detailed view of individual operations, such as creating branches or pushing commits, while activities aggregate related actions into high-level tasks, such as merging pull requests or resolving issues. This hierarchy bridges the gap between granular operations and contributors’ broader intentions.
The primary dataset focuses on activities, providing a high-level overview of contributor behavior. For users requiring more granular analysis, a complementary dataset of actions is also included.
The dataset is accompanied by a Python-based command-line tool that automates the transformation of raw GitHub event logs into structured actions and activities. The tool, along with its configurable mapping files and scripts, is publicly available at https://github.com/uhourri/ghmap" target="_blank" rel="noopener">ghmap.
The dataset is distributed across the following files:
Each action record captures a single contributor operation and includes the following attributes:
The dataset encompasses 24 distinct action types, each derived from specific GitHub events and representing a well-defined contributor operation:
Example of action record:
{
"action":"CloseIssue",
"event_id":"26170139709",
"date":"2023-01-01T20:19:58Z",
"actor":{
"id":1282691,
"login":"KristofferC"
},
"repository":{
"id":1644196,
"name":"JuliaLang/julia",
"organisation":"JuliaLang",
"organisation_id":743164
},
"details":{
"issue":{
"id":1515182791,
"number":48062,
"title":"Bad default number of BLAS threads on 1.9?",
"state":"closed",
"author":{
"id":1282691,
"login":"KristofferC"
},
"labels":[
{
"name":"linear algebra",
"description":"Linear algebra"
}
],
"created_date":"2022-12-31T18:49:47Z",
"updated_date":"2023-01-01T20:19:58Z",
"closed_date":"2023-01-01T20:19:57Z"
}
}
}
To provide a more meaningful abstraction, actions are grouped into activities. Activities represent cohesive, high-level tasks performed by contributors, such as merging a pull request, publishing a release, or resolving an issue. This higher-level grouping removes noise from low-level event logs and aligns with the contributor's intent .
Activities are constructed based on logical and temporal criteria. For example, merging a pull request may involve several distinct actions: closing the pull request, pushing the merged changes, and deleting the source branch. By aggregating these actions, the activity more accurately reflects the contributor’s intent.
Each activity record represents a cohesive, high-level task and includes the following attributes:
The dataset includes 21 distinct activity types, which aggregate related actions based on logical and temporal criteria to represent contributors’ high-level intent:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Each directory is named after the task_id
field from the following query to the Materials Project database:
from pymatgen.ext.matproj import MPRester
with MPRester() as m:
q_res = m.query(
criteria={
"nelements": 1,
"e_above_hull": {"$lt": 0.1},
"nsites": {"$lt": 20},
"e_above_hull": {"$lt": 0.00001},
},
properties=["energy", "structure", "e_above_hull", "task_id", "exp"],
)
There are 117 directories in all. Each directory contains the POSCAR
file of the unit cell, the CHGCAR
's of the unit cell and two different supercells:
sc1 = uc * [
[1, 1, 0],
[1, -1, 0],
[0, 0, 1],
]
sc2 = uc * [
[2, 0, 0],
[0, 2, 0],
[0, 0, 2],
]
and the output of the validation analysis validate_sc.json
which should look like this:
{
"sc1": {
"1": 0.0002730357775883091,
"2": 6.913892646285771e-05,
"4": 1.7710165019594026e-05
},
"sc2": {
"1": 0.0002667279377434944,
"2": 6.911585183768033e-05,
"4": 2.6712034073784627e-05
},
"formula": "H2"
}
The output of the validation analysis is created using the validate_sc.py
script, which calculates the average of the difference between the re-gridded and explicitly calculated charge densities.
The differences are stored in units of electrons/Angstrom^3 for each supercell and up-sampling factors 1/2/4.
Once the JSON files are in place, the plot from the paper can be generated using the plot.py
script.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset for the manuscript entitled: Phylogenetic and genomic insights into the evolution of terpenoid biosynthesis genes in diverse plant lineages1 "Species name.xlsx": This Excel file includes species name abbreviations and their full names.2 "Phylogeny" Folder:(Supplemental Figures S2–S25) This folder contains files related to the phylogenetic analyses for genes, including phylogenetic trees and corresponding amino acid sequences. 2.1 Folder Content: Phylogenetic tree files (.raxml_bs.tre): These files are the phylogenetic trees of each gene, generated using RAxML. Amino acid sequence files (.fa): These files contain the amino acid sequences used to construct the phylogenetic trees. 2.2 The step to construct phylogenetic trees: you can use the following command in a Linux environment: python2 fasta_to_tree.py ./input_dir/ 4 aa y # Place amino acid sequence files (*.fa) in the 'input_dir' directory. The script 'fasta_to_tree.py' is available at: https://bitbucket.org/yanglab/phylogenomic_dataset_construction/src/master/scripts/3 "Expression" Folder:(Figures 5A and 5B; Supplemental Figures S33 and S34) This folder is used for analyzing gene expression levels, including extracting, processing, and visualizing expression level. In this study, gene expression levels are represented using Transcripts Per Million (TPM). 3.1 Folder Content: '1total-TPM' and '2total-TPM': RNA-seq reads for each species were mapped to CDSs of each species using Salmon v1.3.0. The sequence IDs and TPM values in the output files 'quant.sf' of each species were extracted and combined into 'total_TPM'. As the file size of 'total_TPM' exceeded the allowed size in Excel, we split it into '1total-TPM' and '2total-TPM'. Both the two files are in format with two columns: ID: A unique identifier for each gene. TPM: The normalized expression value for the corresponding gene. These files were used to extract the expression levels of target genes. 'input/' directory: This directory contains the gene ID files for which the expression levels need to be extracted. The gene IDs are derived from the amino acid sequence files in the "Phylogeny" folder. From these sequence files, the gene IDs for the target species are extracted. For example, you can use the following command in a Linux environment: grep '>' DXR-MEP.fa > DXR.xlsx #'DXR-MEP.fa' is the input file from the "Phylogeny" folder in this case. 'DXR.xlsx' is the output file, which contains the extracted gene IDs. After extracting the gene IDs, make sure to add a column header named "ID" in the output file (DXR.xlsx). Python Scripts (.py): These three scripts are executed in Visual Studio Code, using Python 3.10.4 as the runtime environment. 'gene_expression-average.py': The processed data generated by this script is used for downstream visualization tasks, such as creating heatmaps (Figure 5A) and raincloud plots (Figure 5B). Extracts the gene expression levels from '1total-TPM' and '2total-TPM' for the gene IDs in the 'input' directory. Calculates the average expression level of terpenoid biosynthesis genes for each species and the results will be stored in the summary file 'average-expression.xlsx'. 'gene_expression-3highest-average.py': The processed data generated by this script is used for downstream visualization tasks, such as creating heatmaps (Supplemental Figure S33A) and raincloud plots (Supplemental Figure S33B). Extracts the expression levels for the top three highest-expressed genes of terpenoid biosynthesis genes for each species. Calculates the average of the top three values and the results will be stored in the summary file '3highest-average-expression.xlsx'. 'gene_expression-sum.py': The processed data generated by this script is used for downstream visualization tasks, such as creating heatmaps (Supplemental Figure S34A) and raincloud plots (Supplemental Figure S34B). Extracts the gene expression levels from 1total-TPM and 2total-TPM for the gene IDs in the 'input' directory. Calculates the sum of expression levels of terpenoid biosynthesis genes for each species and the results will be stored in the summary file 'sum-expression.xlsx'. CSV Files (.csv): These files are used as input for generating visualizations (e.g., heatmaps and raincloud plots). 'average-expression-log.csv': (Figure 5A and 5B) The file is generated by running the script 'gene_expression-average.py', which produces the file 'average-expression.xlsx'. The 'average-expression.xlsx' file is then converted to a CSV format and further processed with a log2 transformation to create 'average-expression-log.csv'. This file contains processed data for the average expression levels of genes. All values are log2-transformed for better visualization and analysis. '3highest-average-expression-log.csv': (Supplemental Figure S33A and S33B) The file is generated by running the script 'gene_expression-3highest-average.py', which produces the file 'average-expression.xlsx'. The 'average-expression.xlsx' file is then converted to a CSV format and further processed with a log2 transformation to create '3highest-average-expression-log.csv'. This file contains the top three highest average expression levels for each gene family. All values are log2-transformed. 'sum-expression-log.csv': (Supplemental Figure S34A and S34B) The file is generated by running the script 'gene_expression-sum.py', which produces the file 'sum-expression.xlsx'. The 'sum-expression.xlsx' file is then converted to a CSV format and further processed with a log2 transformation to create 'sum-expression-log.csv'. Contains the summed expression levels for genes across selected species. All values are log2-transformed. 'heatmap.r': This file is used to generate heatmaps for visualizing gene expression levels. The raincloud plots are generated by tvBOT (https://www.chiplot.online/). 3.2 Workflow: 3.2.1. Preparation: Place the reference files ('1total-TPM' and '2total-TPM') and the gene ID files ('input/') in the respective directories. 3.2.2. Run Python Scripts: Use the following scripts based on the required analysis: Sum of expression levels: Run 'gene_expression-sum.py'. Average expression levels: Run 'gene_expression-average.py'. Top three highest averages: Run 'gene_expression-3highest-average.py'. Each script will generate a summary file in .xlsx format. 3.2.3. Post-process the Outputs: Convert the summary file (.xlsx) into .csv format. Apply log2 transformation to all expression values in the .csv files. 3.2.4. Generate heatmaps and raincloud plots: Use the R script (heatmap.r) to create heatmaps from the log2-transformed .csv files. (Figure 5A; Supplemental Figures S33A and S34A) Use the tvBOT (https://www.chiplot.online/) to create raincloud plots from the log2-transformed .csv files (Figure 5B; Supplemental Figures S33B and S34B).4 "KaKs" Folder: (Figure 5C) This folder is used to process Ka/Ks data for individual genes and generate boxplots to visualize the Ka/Ks distribution. 4.1 Folder Content: Python Scripts (.py): 'cdhit.py': This script is executed in a Linux environment. It is run directly in the command: python3 cdhit.py This script is designed to remove redundant sequences using the CD-HIT tool. The input for this script is a FASTA file containing the sequences of terpenoid biosynthesis genes. These genes are identified using the gene IDs provided in the "Expression" folder. The gene IDs are used to retrieve the corresponding sequences, which are then organized into a FASTA file to serve as the input for cd-hit. The non-redundant sequences obtained from this script are used for downstream Ka/Ks calculations. 'gene_pair.py': This script is executed in a Linux environment. It is run directly in the command: python3 gene_pair.py This script generates all possible gene pairs (including reverse pairs and no self-pairs) such as 'DXR_pair.id' from a list of gene IDs provided in an input file. The gene IDs used as input for this script are obtained from the non-redundant sequences generated by the 'cdhit.py'. 'boxplot.py': This script is executed in Visual Studio Code, using Python 3.10.4 as the runtime environment. This script processes the data in the 'input/' directory and creates a boxplot for Ka/Ks distribution (Figure 5C). 'DXR_pair.id': This file is an example of gene pairs specifically for the DXR gene family in the target species. This file is generated using the 'gene_pair.py'. A text file containing all possible gene pairs, formatted as: Gene1 Gene2 Gene1 Gene3 Gene2 Gene1 ... 'input/' directory: Contains Excel files with Ka/Ks ratios for each species. The Ka/Ks ratios are calculated using ParaAT and KaKs_Calculator for gene pairs. The resulting data is organized into an Excel file for each gene family, which contains two columns: 'Species' and 'Ka/Ks'. The Excel file is stored in the 'input/' directory. 4.2 Workflow: 4.2.1. Preparation: Use 'cdhit.py' to process gene sequences and generate non-redundant sequences. The genes used here are the same as those in the "Expression" folder. Extract gene IDs from the non-redundant sequences. Use 'gene_pair.py' to generate gene pairs from the extracted gene IDs. Calculate Ka/Ks values for the gene pairs using ParaAT and KaKs_Calculator, and then organize the results. Pre-process the output of KaKs_Calculator into .xlsx files with the following structure: Species: The species name. Ka/Ks: The calculated Ka/Ks ratios. 4.2.2. Run the Script: 'boxplot.py'
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ONE DATA data science workflow dataset ODDS-full comprises 815 unique workflows in temporally ordered versions.
A version of a workflow describes its evolution over time, so whenever a workflow is altered meaningfully, a new version of this respective workflow is persisted.
Overall, 16035 versions are available.
The ODDS-full workflows represent machine learning workflows expressed as node-heterogeneous DAGs with 156 different node types.
These node types represent various kinds of processing steps of a general machine learning workflow and are grouped into 5 categories, which are listed below.
Any metadata beyond the structure and node types of a workflow has been removed for anonymization purposes
ODDS, a filtered variant, which enforces weak connectedness and only contains workflows with at least 5 different versions and 5 nodes, is available as the default version for supervised and unsupvervised learning.
Workflows are served as JSON node-link graphs via networkx.
They can be loaded into python as follows:
import pandas as pd
import networkx as nx
import json
with open('ODDS.json', 'r') as f:
graphs = pd.Series(list(map(nx.node_link_graph, json.load(f)['graphs'])))
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
📝 Dataset Overview: This dataset represents real-world, enhanced transactional data from Timac Global Concept, one of Nigeria’s prominent players in fuel and petroleum distribution. It includes comprehensive sales records across multiple stations and product categories (AGO, PMS, Diesel, Lubricants, LPG), along with revenue and shift-based operational tracking.
The dataset is ideal for analysts, BI professionals, and data science students aiming to explore fuel economy trends, pricing dynamics, and operational analytics.
🔍 Dataset Features: Column Name Description Date Transaction date Station_Name Name of the fuel station AGO_Sales (L) Automotive Gas Oil sold in liters PMS_Sales (L) Premium Motor Spirit sold in liters Lubricant_Sales (L) Lubricant sales in liters Diesel_Sales (L) Diesel sold in liters LPG_Sales (kg) Liquefied Petroleum Gas sold in kilograms Total_Revenue (₦) Total revenue generated in Nigerian Naira AGO_Price Price per liter of AGO PMS_Price Price per liter of PMS Lubricant_Price Unit price of lubricants Diesel_Price Price per liter of diesel LPG_Price Price per kg of LPG Product_Category Fuel product type Shift Work shift (e.g., Morning, Night) Supervisor Supervisor in charge during shift Weekday Day of the week for each transaction
🎯 Use Cases: Build Power BI dashboards to track fuel sales trends and shifts
Perform revenue forecasting using time series models
Analyze price dynamics vs sales volume
Visualize station-wise performance and weekday sales patterns
Conduct operational audits per supervisor or shift
🧰 Best Tools for Analysis: Power BI, Tableau
Python (Pandas, Matplotlib, Plotly)
Excel for pivot tables and summaries
SQL for fuel category insights
👤 Created By: Fatolu Peter (Emperor Analytics) Data analyst focused on real-life data transformation in Nigeria’s petroleum, healthcare, and retail sectors. This is Project 11 in my growing portfolio of end-to-end analytics challenges.
✅ LinkedIn Post: ⛽ New Dataset Alert – Fuel Economy & Sales Data Now on Kaggle! 📊 Timac Fuel Distribution & Revenue Dataset (Nigeria – 500 Records) 🔗 Explore the data here
Looking to practice business analytics, revenue forecasting, or operational dashboards?
This dataset contains:
Daily sales of AGO, PMS, Diesel, LPG & Lubricants
Revenue breakdowns by station
Shift & supervisor tracking
Fuel prices across product categories
You can use this to: ✅ Build Power BI sales dashboards ✅ Create fuel trend visualizations ✅ Analyze shift-level profitability ✅ Forecast revenue using Python or Excel
Let’s put real Nigerian data to real analytical work. Tag me when you build with it—I’d love to celebrate your work!
The INTEGRATE (Inverse Network Transformations for Efficient Generation of Robust Airfoil and Turbine Enhancements) project is developing a new inverse-design capability for the aerodynamic design of wind turbine rotors using invertible neural networks. This AI-based design technology can capture complex non-linear aerodynamic effects while being 100 times faster than design approaches based on computational fluid dynamics. This project enables innovation in wind turbine design by accelerating time to market through higher-accuracy early design iterations to reduce the levelized cost of energy. INVERTIBLE NEURAL NETWORKS Researchers are leveraging a specialized invertible neural network (INN) architecture along with the novel dimension-reduction methods and airfoil/blade shape representations developed by collaborators at the National Institute of Standards and Technology (NIST) learns complex relationships between airfoil or blade shapes and their associated aerodynamic and structural properties. This INN architecture will accelerate designs by providing a cost-effective alternative to current industrial aerodynamic design processes, including: Blade element momentum (BEM) theory models: limited effectiveness for design of offshore rotors with large, flexible blades where nonlinear aerodynamic effects dominate Direct design using computational fluid dynamics (CFD): cost-prohibitive Inverse-design models based on deep neural networks (DNNs): attractive alternative to CFD for 2D design problems, but quickly overwhelmed by the increased number of design variables in 3D problems AUTOMATED COMPUTATIONAL FLUID DYNAMICS FOR TRAINING DATA GENERATION - MERCURY FRAMEWORK The INN is trained on data obtained using the University of Marylands (UMD) Mercury Framework, which has with robust automated mesh generation capabilities and advanced turbulence and transition models validated for wind energy applications. Mercury is a multi-mesh paradigm, heterogeneous CPU-GPU framework. The framework incorporates three flow solvers at UMD, 1) OverTURNS, a structured solver on CPUs, 2) HAMSTR, a line based unstructured solver on CPUs, and 3) GARFIELD, a structured solver on GPUs. The framework is based on Python, that is often used to wrap C or Fortran codes for interoperability with other solvers. Communication between multiple solvers is accomplished with a Topology Independent Overset Grid Assembler (TIOGA). NOVEL AIRFOIL SHAPE REPRESENTATIONS USING GRASSMAN SPACES We developed a novel representation of shapes which decouples affine-style deformations from a rich set of data-driven deformations over a submanifold of the Grassmannian. The Grassmannian representation as an analytic generative model, informed by a database of physically relevant airfoils, offers (i) a rich set of novel 2D airfoil deformations not previously captured in the data , (ii) improved low-dimensional parameter domain for inferential statistics informing design/manufacturing, and (iii) consistent 3D blade representation and perturbation over a sequence of nominal shapes. TECHNOLOGY TRANSFER DEMONSTRATION - COUPLING WITH NREL WISDEM Researchers have integrated the inverse-design tool for 2D airfoils (INN-Airfoil) into WISDEM (Wind Plant Integrated Systems Design and Engineering Model), a multidisciplinary design and optimization framework for assessing the cost of energy, as part of tech-transfer demonstration. The integration of INN-Airfoil into WISDEM allows for the design of airfoils along with the blades that meet the dynamic design constraints on cost of energy, annual energy production, and the capital costs. Through preliminary studies, researchers have shown that the coupled INN-Airfoil + WISDEM approach reduces the cost of energy by around 1% compared to the conventional design approach. This page will serve as a place to easily access all the publications from this work and the repositories for the software developed and released through this project.