26 datasets found
  1. d

    Data from: INTEGRATE - Inverse Network Transformations for Efficient...

    • catalog.data.gov
    • data.openei.org
    • +3more
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Renewable Energy Laboratory (NREL) (2023). INTEGRATE - Inverse Network Transformations for Efficient Generation of Robust Airfoil and Turbine Enhancements [Dataset]. https://catalog.data.gov/dataset/integrate-inverse-network-transformations-for-efficient-generation-of-robust-airfoil-and-t
    Explore at:
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    National Renewable Energy Laboratory (NREL)
    Description

    The INTEGRATE (Inverse Network Transformations for Efficient Generation of Robust Airfoil and Turbine Enhancements) project is developing a new inverse-design capability for the aerodynamic design of wind turbine rotors using invertible neural networks. This AI-based design technology can capture complex non-linear aerodynamic effects while being 100 times faster than design approaches based on computational fluid dynamics. This project enables innovation in wind turbine design by accelerating time to market through higher-accuracy early design iterations to reduce the levelized cost of energy. INVERTIBLE NEURAL NETWORKS Researchers are leveraging a specialized invertible neural network (INN) architecture along with the novel dimension-reduction methods and airfoil/blade shape representations developed by collaborators at the National Institute of Standards and Technology (NIST) learns complex relationships between airfoil or blade shapes and their associated aerodynamic and structural properties. This INN architecture will accelerate designs by providing a cost-effective alternative to current industrial aerodynamic design processes, including: Blade element momentum (BEM) theory models: limited effectiveness for design of offshore rotors with large, flexible blades where nonlinear aerodynamic effects dominate Direct design using computational fluid dynamics (CFD): cost-prohibitive Inverse-design models based on deep neural networks (DNNs): attractive alternative to CFD for 2D design problems, but quickly overwhelmed by the increased number of design variables in 3D problems AUTOMATED COMPUTATIONAL FLUID DYNAMICS FOR TRAINING DATA GENERATION - MERCURY FRAMEWORK The INN is trained on data obtained using the University of Marylands (UMD) Mercury Framework, which has with robust automated mesh generation capabilities and advanced turbulence and transition models validated for wind energy applications. Mercury is a multi-mesh paradigm, heterogeneous CPU-GPU framework. The framework incorporates three flow solvers at UMD, 1) OverTURNS, a structured solver on CPUs, 2) HAMSTR, a line based unstructured solver on CPUs, and 3) GARFIELD, a structured solver on GPUs. The framework is based on Python, that is often used to wrap C or Fortran codes for interoperability with other solvers. Communication between multiple solvers is accomplished with a Topology Independent Overset Grid Assembler (TIOGA). NOVEL AIRFOIL SHAPE REPRESENTATIONS USING GRASSMAN SPACES We developed a novel representation of shapes which decouples affine-style deformations from a rich set of data-driven deformations over a submanifold of the Grassmannian. The Grassmannian representation as an analytic generative model, informed by a database of physically relevant airfoils, offers (i) a rich set of novel 2D airfoil deformations not previously captured in the data , (ii) improved low-dimensional parameter domain for inferential statistics informing design/manufacturing, and (iii) consistent 3D blade representation and perturbation over a sequence of nominal shapes. TECHNOLOGY TRANSFER DEMONSTRATION - COUPLING WITH NREL WISDEM Researchers have integrated the inverse-design tool for 2D airfoils (INN-Airfoil) into WISDEM (Wind Plant Integrated Systems Design and Engineering Model), a multidisciplinary design and optimization framework for assessing the cost of energy, as part of tech-transfer demonstration. The integration of INN-Airfoil into WISDEM allows for the design of airfoils along with the blades that meet the dynamic design constraints on cost of energy, annual energy production, and the capital costs. Through preliminary studies, researchers have shown that the coupled INN-Airfoil + WISDEM approach reduces the cost of energy by around 1% compared to the conventional design approach. This page will serve as a place to easily access all the publications from this work and the repositories for the software developed and released through this project.

  2. Linear Transformation

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Sep 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gao Ruohan; Gao Ruohan (2020). Linear Transformation [Dataset]. http://doi.org/10.6084/m9.figshare.12992972.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 23, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gao Ruohan; Gao Ruohan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is a csv file resulted from a linear transformation y = 3*x+6 of 1000 randomly generated number between 0 - 100. It was generated by applying a linear transformation on 1000 data points generated from random.randint() function.

  3. m

    A dataset for conduction heat transer and deep learning

    • data.mendeley.com
    Updated Jun 25, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Edalatifar (2020). A dataset for conduction heat transer and deep learning [Dataset]. http://doi.org/10.17632/rw9yk3c559.1
    Explore at:
    Dataset updated
    Jun 25, 2020
    Authors
    Mohammad Edalatifar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Big data images for conduction heat transfer The related paper has been published here: M. Edalatifar, M.B. Tavakoli, M. Ghalambaz, F. Setoudeh, Using deep learning to learn physics of conduction heat transfer, Journal of Thermal Analysis and Calorimetry; 2020. https://doi.org/10.1007/s10973-020-09875-6 Steps to reproduce: The dataset is saved in two format, .npz for python and .mat for matlab. *.mat has large size, then it is compressed with winzip. ReadDataset_Python.py and ReadDataset_Matlab.m are examples of read data using python and matlab respectively. For use dataset in matlab download Dataset/HeatTransferPhenomena_35_58.zip, unzip it and then use ReadDataset_Matlab.m as an example. In case of python, download Dataset/HeatTransferPhenomena_35_58.npz and run ReadDataset_Python.py.

  4. d

    Soil images in DICOM format including Python programs for data...

    • search.dataone.org
    • datadryad.org
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ralf Wieland (2025). Soil images in DICOM format including Python programs for data transformation, 3D analysis, CNN traininig, CNN analysis [Dataset]. http://doi.org/10.5061/dryad.66t1g1k0c
    Explore at:
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Ralf Wieland
    Time period covered
    Jan 1, 2020
    Description

    The 'Use of Deep Learning for structural analysis of CT-images of soil samples' used a set of soil sample data (CT-images). All the data and programs used here are open source and were created with the help of open source software. All steps are made by Python programs which are included in the data set.

  5. Z

    Wrist-mounted IMU data towards the investigation of free-living human eating...

    • data.niaid.nih.gov
    Updated Jun 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kyritsis, Konstantinos (2022). Wrist-mounted IMU data towards the investigation of free-living human eating behavior - the Free-living Food Intake Cycle (FreeFIC) dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4420038
    Explore at:
    Dataset updated
    Jun 20, 2022
    Dataset provided by
    Diou, Christos
    Delopoulos, Anastasios
    Kyritsis, Konstantinos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    The Free-living Food Intake Cycle (FreeFIC) dataset was created by the Multimedia Understanding Group towards the investigation of in-the-wild eating behavior. This is achieved by recording the subjects’ meals as a small part part of their everyday life, unscripted, activities. The FreeFIC dataset contains the (3D) acceleration and orientation velocity signals ((6) DoF) from (22) in-the-wild sessions provided by (12) unique subjects. All sessions were recorded using a commercial smartwatch ((6) using the Huawei Watch 2™ and the MobVoi TicWatch™ for the rest) while the participants performed their everyday activities. In addition, FreeFIC also contains the start and end moments of each meal session as reported by the participants.

    Description

    FreeFIC includes (22) in-the-wild sessions that belong to (12) unique subjects. Participants were instructed to wear the smartwatch to the hand of their preference well ahead before any meal and continue to wear it throughout the day until the battery is depleted. In addition, we followed a self-report labeling model, meaning that the ground truth is provided from the participant by documenting the start and end moments of their meals to the best of their abilities as well as the hand they wear the smartwatch on. The total duration of the (22) recordings sums up to (112.71) hours, with a mean duration of (5.12) hours. Additional data statistics can be obtained by executing the provided python script stats_dataset.py. Furthermore, the accompanying python script viz_dataset.py will visualize the IMU signals and ground truth intervals for each of the recordings. Information on how to execute the Python scripts can be found below.

    The script(s) and the pickle file must be located in the same directory.

    Tested with Python 3.6.4

    Requirements: Numpy, Pickle and Matplotlib

    Calculate and echo dataset statistics

    $ python stats_dataset.py

    Visualize signals and ground truth

    $ python viz_dataset.py

    FreeFIC is also tightly related to Food Intake Cycle (FIC), a dataset we created in order to investigate the in-meal eating behavior. More information about FIC can be found here and here.

    Publications

    If you plan to use the FreeFIC dataset or any of the resources found in this page, please cite our work:

    @article{kyritsis2020data,
    title={A Data Driven End-to-end Approach for In-the-wild Monitoring of Eating Behavior Using Smartwatches},
    author={Kyritsis, Konstantinos and Diou, Christos and Delopoulos, Anastasios},
    journal={IEEE Journal of Biomedical and Health Informatics}, year={2020},
    publisher={IEEE}}

    @inproceedings{kyritsis2017automated, title={Detecting Meals In the Wild Using the Inertial Data of a Typical Smartwatch}, author={Kyritsis, Konstantinos and Diou, Christos and Delopoulos, Anastasios}, booktitle={2019 41th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)},
    year={2019}, organization={IEEE}}

    Technical details

    We provide the FreeFIC dataset as a pickle. The file can be loaded using Python in the following way:

    import pickle as pkl import numpy as np

    with open('./FreeFIC_FreeFIC-heldout.pkl','rb') as fh: dataset = pkl.load(fh)

    The dataset variable in the snipet above is a dictionary with (5) keys. Namely:

    'subject_id'

    'session_id'

    'signals_raw'

    'signals_proc'

    'meal_gt'

    The contents under a specific key can be obtained by:

    sub = dataset['subject_id'] # for the subject id ses = dataset['session_id'] # for the session id raw = dataset['signals_raw'] # for the raw IMU signals proc = dataset['signals_proc'] # for the processed IMU signals gt = dataset['meal_gt'] # for the meal ground truth

    The sub, ses, raw, proc and gt variables in the snipet above are lists with a length equal to (22). Elements across all lists are aligned; e.g., the (3)rd element of the list under the 'session_id' key corresponds to the (3)rd element of the list under the 'signals_proc' key.

    sub: list Each element of the sub list is a scalar (integer) that corresponds to the unique identifier of the subject that can take the following values: ([1, 2, 3, 4, 13, 14, 15, 16, 17, 18, 19, 20]). It should be emphasized that the subjects with ids (15, 16, 17, 18, 19) and (20) belong to the held-out part of the FreeFIC dataset (more information can be found in ( )the publication titled "A Data Driven End-to-end Approach for In-the-wild Monitoring of Eating Behavior Using Smartwatches" by Kyritsis et al). Moreover, the subject identifier in FreeFIC is in-line with the subject identifier in the FIC dataset (more info here and here); i.e., FIC’s subject with id equal to (2) is the same person as FreeFIC’s subject with id equal to (2).

    ses: list Each element of this list is a scalar (integer) that corresponds to the unique identifier of the session that can range between (1) and (5). It should be noted that not all subjects have the same number of sessions.

    raw: list Each element of this list is dictionary with the 'acc' and 'gyr' keys. The data under the 'acc' key is a (N_{acc} \times 4) numpy.ndarray that contains the timestamps in seconds (first column) and the (3D) raw accelerometer measurements in (g) (second, third and forth columns - representing the (x, y ) and (z) axis, respectively). The data under the 'gyr' key is a (N_{gyr} \times 4) numpy.ndarray that contains the timestamps in seconds (first column) and the (3D) raw gyroscope measurements in ({degrees}/{second})(second, third and forth columns - representing the (x, y ) and (z) axis, respectively). All sensor streams are transformed in such a way that reflects all participants wearing the smartwatch at the same hand with the same orientation, thusly achieving data uniformity. This transformation is in par with the signals in the FIC dataset (more info here and here). Finally, the length of the raw accelerometer and gyroscope numpy.ndarrays is different ((N_{acc} eq N_{gyr})). This behavior is predictable and is caused by the Android platform.

    proc: list Each element of this list is an (M\times7) numpy.ndarray that contains the timestamps, (3D) accelerometer and gyroscope measurements for each meal. Specifically, the first column contains the timestamps in seconds, the second, third and forth columns contain the (x,y) and (z) accelerometer values in (g) and the fifth, sixth and seventh columns contain the (x,y) and (z) gyroscope values in ({degrees}/{second}). Unlike elements in the raw list, processed measurements (in the proc list) have a constant sampling rate of (100) Hz and the accelerometer/gyroscope measurements are aligned with each other. In addition, all sensor streams are transformed in such a way that reflects all participants wearing the smartwatch at the same hand with the same orientation, thusly achieving data uniformity. This transformation is in par with the signals in the FIC dataset (more info here and here). No other preprocessing is performed on the data; e.g., the acceleration component due to the Earth's gravitational field is present at the processed acceleration measurements. The potential researcher can consult the article "A Data Driven End-to-end Approach for In-the-wild Monitoring of Eating Behavior Using Smartwatches" by Kyritsis et al. on how to further preprocess the IMU signals (i.e., smooth and remove the gravitational component).

    meal_gt: list Each element of this list is a (K\times2) matrix. Each row represents the meal intervals for the specific in-the-wild session. The first column contains the timestamps of the meal start moments whereas the second one the timestamps of the meal end moments. All timestamps are in seconds. The number of meals (K) varies across recordings (e.g., a recording exist where a participant consumed two meals).

    Ethics and funding

    Informed consent, including permission for third-party access to anonymised data, was obtained from all subjects prior to their engagement in the study. The work has received funding from the European Union's Horizon 2020 research and innovation programme under Grant Agreement No 727688 - BigO: Big data against childhood obesity.

    Contact

    Any inquiries regarding the FreeFIC dataset should be addressed to:

    Dr. Konstantinos KYRITSIS

    Multimedia Understanding Group (MUG) Department of Electrical & Computer Engineering Aristotle University of Thessaloniki University Campus, Building C, 3rd floor Thessaloniki, Greece, GR54124

    Tel: +30 2310 996359, 996365 Fax: +30 2310 996398 E-mail: kokirits [at] mug [dot] ee [dot] auth [dot] gr

  6. Dual Simulated SEND and EDX dataset

    • zenodo.org
    zip
    Updated Apr 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andy Bridger; Andy Bridger (2024). Dual Simulated SEND and EDX dataset [Dataset]. http://doi.org/10.5281/zenodo.11061945
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 24, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andy Bridger; Andy Bridger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A simulated SEND + EDX dataset along with the code used to produce it.

    • code
      • SEND_ground_Truth_Segment_Model-AB.ipynb (ipynb which outlines the code for end to end data production, however some of the actual SEND data production is done through the gen_data.py and add_noise.py files due to system memory requirements making a cluster job more convenient)
      • gen_data.py (python file for creating an intermediate simulated SEND dataset)
      • add_noise.py (python file that takes the intermediate SEND dataset and samples it to produce pseudo-experimental data)
    • phase_maps
      • Contains pairs of jpg/npy files that show/quantify the proportional of each phase at each pixel location as constructed in the atomic model
    • data
      • SEND.hspy (the simulated SEND dataset as per the atomic model)
      • EDS.hspy (the simulated EDS dataset)
      • EDS-varied-dose.zip (EDS simulations at different electron doses)
      • atomic_model.xyz (ASE atomic model for the simulated data)
      • labelled_voxels.npy (the phase labels for the 3d array of volumetric-pixels used to produce the atomic model)

    Added in newer version: the VAE processing of the SEND data has been included

    • data
      • RadialData
        • data_radial.hspy (A radial transformation of the simulated SEND dataset used for VAE testing)
        • data_radial_training_data.hspy (The data_radial.hspy dataset but with pattern populations reweighted to better represent high variance regions)
        • navigation_axis_variance.npy (The mean variance within the 2D diffraction signal at each pixel probe position)
        • signal_axis_variance.npy (The variance of each signal pixel in the 2D diffraction pattern, averaged for each pixel probe position)
        • RadialModel
          • best_model.hdf5 (The trained weights for the VAE)
        • PCA_comps_mse
          • N (The number of PCA components used to estimate the centroids of the clustering)
            • latspacedata.npy (The coordinates of the simulated SEND data in the 2d latent space)
            • mapdata.npy (The assigned cluster label to each of the patterns in the simulated SEND data)
            • Regions
              • i.jpg (the region and radial pattern of each of the clusters
        • ML_clusters_mse
          • encode_data.npy (The coordinates of the simulated SEND data in the 2d latent space)
          • enc_mask.npy (The encoded data transformed into a density based fixed image)
          • ml_cluster_map.npy (The ML predictions of the centroid locations)
          • ML_clusters
            • N (This folder is then the same as the PCA equivalent)
  7. Z

    Photocatalysis Ontology - Dataset and RO-Crates packages

    • data.niaid.nih.gov
    Updated Sep 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oier Beaskoetxea Aldazabal (2022). Photocatalysis Ontology - Dataset and RO-Crates packages [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7097811
    Explore at:
    Dataset updated
    Sep 26, 2022
    Dataset authored and provided by
    Oier Beaskoetxea Aldazabal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this package are the datasets extracted from the Artleafs database, as well as the RO-Crate packages and the RDF dataset generated from them for the project to create RO-Crates using the PHCAT ontology. You can also find python scripts used to transform the extracted CSV data into a new RDF dataset allowing you to create more RO-Crate packages if desired.

    -./data: contains the set of data extracted from the database in CSV format.

    • ./resources: contains the generated RO-Crate packages as well as the mapping files used and the RDF subsets of each article.

    • ./OutputPhotocatalysisMapping.ttl: is the file in turtle format in charge of storing the global RDF data set after the translation of the database data.

    The rest of the folders and files contain mapping rules and scripts used in the data transformation process. For more information check the following GitHub repository: https://github.com/oeg-upm/photocatalysis-ontology.

  8. d

    Grammar transformations of topographic feature type annotations of the U.S....

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Jul 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Grammar transformations of topographic feature type annotations of the U.S. to structured graph data. [Dataset]. https://catalog.data.gov/dataset/grammar-transformations-of-topographic-feature-type-annotations-of-the-u-s-to-structured-g
    Explore at:
    Dataset updated
    Jul 20, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    United States
    Description

    These data were used to examine grammatical structures and patterns within a set of geospatial glossary definitions. Objectives of our study were to analyze the semantic structure of input definitions, use this information to build triple structures of RDF graph data, upload our lexicon to a knowledge graph software, and perform SPARQL queries on the data. Upon completion of this study, SPARQL queries were proven to effectively convey graph triples which displayed semantic significance. These data represent and characterize the lexicon of our input text which are used to form graph triples. These data were collected in 2024 by passing text through multiple Python programs utilizing spaCy (a natural language processing library) and its pre-trained English transformer pipeline. Before data was processed by the Python programs, input definitions were first rewritten as natural language and formatted as tabular data. Passages were then tokenized and characterized by their part-of-speech, tag, dependency relation, dependency head, and lemma. Each word within the lexicon was tokenized. A stop-words list was utilized only to remove punctuation and symbols from the text, excluding hyphenated words (ex. bowl-shaped) which remained as such. The tokens’ lemmas were then aggregated and totaled to find their recurrences within the lexicon. This procedure was repeated for tokenizing noun chunks using the same glossary definitions.

  9. Data from: EC-MS dataset of electrocatalytic transformations of butane on Pt...

    • zenodo.org
    bin, text/x-python +1
    Updated Jul 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christine Lucky; Christine Lucky; Shengli Jiang; Shengli Jiang; Chien-Rung Shih; Victor Zavala; Marcel Schreier; Marcel Schreier; Chien-Rung Shih; Victor Zavala (2024). EC-MS dataset of electrocatalytic transformations of butane on Pt [Dataset]. http://doi.org/10.5281/zenodo.12801617
    Explore at:
    tsv, bin, text/x-pythonAvailable download formats
    Dataset updated
    Jul 25, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Christine Lucky; Christine Lucky; Shengli Jiang; Shengli Jiang; Chien-Rung Shih; Victor Zavala; Marcel Schreier; Marcel Schreier; Chien-Rung Shih; Victor Zavala
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This distribution provides the code and reference data for the manuscript of "Understanding the interplay between electrocatalytic C(sp3)‒C(sp3) fragmentation and oxygenation reactions".

    The code can be excecuted using Python version 3.8.

    Data

    The distribution includes an .xlsx file with reference mass spectra data and experimental data in .tsv format.

    Usage

    1. Ensure Python 3.8 and Jupyter Notebook are installed.
    2. Execute each cell in the notebook example.ipynb. A pop-up window will prompt you to upload your experimental mass spectra data; select your file accordingly. The cells are organized as:
      1. Load Data: This step loads the reference spectra data.
      2. Preprocess Data: This step removes background signals and smooths the signal.
      3. Optimization: This step uses constrained least squares optimization to reconstruct spectra and predict flux.
      4. Plot: This step displays the spectra reconstruction and flux prediction.
      5. File Output: This step saves the predicted flux and reconstructed spectra to files.
  10. g

    Text from pdfs found on data.gouv.fr

    • gimi9.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Text from pdfs found on data.gouv.fr [Dataset]. https://gimi9.com/dataset/eu_5ec45f516a58eec727e79af7/
    Explore at:
    Area covered
    France
    Description

    Text extracted from pdfs found on data.gouv.fr ## Description This dataset contains text extracted from 6602 files that have the ‘pdf’ extension in the resource catalog of data.gouv.fr. The dataset contains only the pdfs of 20 Mb or less and which are always available on the URL indicated. The extraction was done with PDFBox via its Python wrapper python-PDFBox. PDFs that are images (scans, maps, etc.) are detected with a simple heuristic: if after converting to text with ‘PDFBox’, the file size is less than 20 bytes, it is considered to be an image. In this case, OCRisation is carried out. This one is made with Tesseract via its Python wrapper pyocr. The result is ‘txt’ files from ‘pdfs’ sorted by organisation (the organisation that published the resource). There are 175 organisations in this dataset, so 175 files. The name of each file corresponds to the string ‘{id-du-dataset}--{id-de-la-resource}.txt’. #### Input Catalogue of data.gouv.fr resources. #### Output Text files of each ‘pdf’ resource found in the catalogue that was successfully converted and satisfied the above constraints. The tree is as follows: Bash . ACTION_Nogent-sur-Marne 53ba55c4a3a729219b7beae2--0cf9f9cd-e398-4512-80de-5fd0e2d1cb0a.txt 53ba55c4a3a729219b7beae2--1ffcb2cb-2355-4426-b74a-946dadeba7f1.txt 53ba55c4a3a729219b7beae2--297a0466-daaa-47f4-972a-0d5bea2ab180.txt 53ba55c4a3a729219b7beae2--3ac0a881-181f-499e-8b3f-c2b0ddd528f7.txt 53ba55c4a3a729219b7beae2--3ca6bd8f-05a6-469a-a36b-afda5a7444a4.txt |... Aeroport_La_Rochelle-Ile_de_Re Agency_de_services_and_payment_ASP Agency_du_Numerique ... “'” ## Distribution of texts [as of 20 May 2020] The top 10 organisations with the largest number of documents is: Python [(‘Les_Lilas’, 1294), (‘Ville_de_Pirae’, 1099), (‘Region_Hauts-de-France’, 592), (‘Ressourcerie_datalocale’, 297), (‘NA’, 268), (‘CORBION’, 244), (‘Education_Nationale’, 189), (‘Incubator_of_Services_Numeriques’, 157), (‘Ministere_des_Solidarites_and_de_la_Sante’, 148), (‘Communaute_dAgglomeration_Plaine_Vallee’, 142)] “'” And their preview in 2D is (HashFeatures+TruncatedSVD+[t-SNE]): Plot t-SNE of DGF texts ## Code The Python scripts used to do this extraction are here. ## Remarks Due to the quality of the original pdfs (low resolution scans, non-aligned pdfs,...) and the performance of the pdf->txt transformation methods, the results can be very loud.

  11. Market Basket Analysis

    • kaggle.com
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 9, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aslan Ahmedov
    Description

    Market Basket Analysis

    Market basket analysis with Apriori algorithm

    The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

    Introduction

    Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

    An Example of Association Rules

    Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

    Strategy

    • Data Import
    • Data Understanding and Exploration
    • Transformation of the data – so that is ready to be consumed by the association rules algorithm
    • Running association rules
    • Exploring the rules generated
    • Filtering the generated rules
    • Visualization of Rule

    Dataset Description

    • File name: Assignment-1_Data
    • List name: retaildata
    • File format: . xlsx
    • Number of Row: 522065
    • Number of Attributes: 7

      • BillNo: 6-digit number assigned to each transaction. Nominal.
      • Itemname: Product name. Nominal.
      • Quantity: The quantities of each product per transaction. Numeric.
      • Date: The day and time when each transaction was generated. Numeric.
      • Price: Product price. Numeric.
      • CustomerID: 5-digit number assigned to each customer. Nominal.
      • Country: Name of the country where each customer resides. Nominal.

    imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

    Libraries in R

    First, we need to load required libraries. Shortly I describe all libraries.

    • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
    • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
    • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
    • readxl - Read Excel Files in R.
    • plyr - Tools for Splitting, Applying and Combining Data.
    • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
    • knitr - Dynamic Report generation in R.
    • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
    • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
    • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

    imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

    Data Pre-processing

    Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

    imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

    After we will clear our data frame, will remove missing values.

    imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

    To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

  12. u

    Data from: T1DiabetesGranada: a longitudinal multi-modal dataset of type 1...

    • produccioncientifica.ugr.es
    • data.niaid.nih.gov
    Updated 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rodriguez-Leon, Ciro; Aviles Perez, Maria Dolores; Banos, Oresti; Quesada-Charneco, Miguel; Lopez-Ibarra, Pablo J; Villalonga, Claudia; Munoz-Torres, Manuel; Rodriguez-Leon, Ciro; Aviles Perez, Maria Dolores; Banos, Oresti; Quesada-Charneco, Miguel; Lopez-Ibarra, Pablo J; Villalonga, Claudia; Munoz-Torres, Manuel (2023). T1DiabetesGranada: a longitudinal multi-modal dataset of type 1 diabetes mellitus [Dataset]. https://produccioncientifica.ugr.es/documentos/668fc429b9e7c03b01bd53b7
    Explore at:
    Dataset updated
    2023
    Authors
    Rodriguez-Leon, Ciro; Aviles Perez, Maria Dolores; Banos, Oresti; Quesada-Charneco, Miguel; Lopez-Ibarra, Pablo J; Villalonga, Claudia; Munoz-Torres, Manuel; Rodriguez-Leon, Ciro; Aviles Perez, Maria Dolores; Banos, Oresti; Quesada-Charneco, Miguel; Lopez-Ibarra, Pablo J; Villalonga, Claudia; Munoz-Torres, Manuel
    Description

    T1DiabetesGranada

    A longitudinal multi-modal dataset of type 1 diabetes mellitus

    Documented by:

    Rodriguez-Leon, C., Aviles-Perez, M. D., Banos, O., Quesada-Charneco, M., Lopez-Ibarra, P. J., Villalonga, C., & Munoz-Torres, M. (2023). T1DiabetesGranada: a longitudinal multi-modal dataset of type 1 diabetes mellitus. Scientific Data, 10(1), 916. https://doi.org/10.1038/s41597-023-02737-4

    Background

    Type 1 diabetes mellitus (T1D) patients face daily difficulties in keeping their blood glucose levels within appropriate ranges. Several techniques and devices, such as flash glucose meters, have been developed to help T1D patients improve their quality of life. Most recently, the data collected via these devices is being used to train advanced artificial intelligence models to characterize the evolution of the disease and support its management. The main problem for the generation of these models is the scarcity of data, as most published works use private or artificially generated datasets. For this reason, this work presents T1DiabetesGranada, a open under specific permission longitudinal dataset that not only provides continuous glucose levels, but also patient demographic and clinical information. The dataset includes 257780 days of measurements over four years from 736 T1D patients from the province of Granada, Spain. This dataset progresses significantly beyond the state of the art as one the longest and largest open datasets of continuous glucose measurements, thus boosting the development of new artificial intelligence models for glucose level characterization and prediction.

    Data Records

    The data are stored in four comma-separated values (CSV) files which are available in T1DiabetesGranada.zip. These files are described in detail below.

    Patient_info.csv

    Patient_info.csv is the file containing information about the patients, such as demographic data, start and end dates of blood glucose level measurements and biochemical parameters, number of biochemical parameters or number of diagnostics. This file is composed of 736 records, one for each patient in the dataset, and includes the following variables:

    Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.

    Sex – Sex of the patient. Values: F (for female), masculine (for male)

    Birth_year – Year of birth of the patient. Format: YYYY.

    Initial_measurement_date – Date of the first blood glucose level measurement of the patient in the Glucose_measurements.csv file. Format: YYYY-MM-DD.

    Final_measurement_date – Date of the last blood glucose level measurement of the patient in the Glucose_measurements.csv file. Format: YYYY-MM-DD.

    Number_of_days_with_measures – Number of days with blood glucose level measurements of the patient, extracted from the Glucose_measurements.csv file. Values: ranging from 8 to 1463.

    Number_of_measurements – Number of blood glucose level measurements of the patient, extracted from the Glucose_measurements.csv file. Values: ranging from 400 to 137292.

    Initial_biochemical_parameters_date – Date of the first biochemical test to measure some biochemical parameter of the patient, extracted from the Biochemical_parameters.csv file. Format: YYYY-MM-DD.

    Final_biochemical_parameters_date – Date of the last biochemical test to measure some biochemical parameter of the patient, extracted from the Biochemical_parameters.csv file. Format: YYYY-MM-DD.

    Number_of_biochemical_parameters – Number of biochemical parameters measured on the patient, extracted from the Biochemical_parameters.csv file. Values: ranging from 4 to 846.

    Number_of_diagnostics – Number of diagnoses realized to the patient, extracted from the Diagnostics.csv file. Values: ranging from 1 to 24.

    Glucose_measurements.csv

    Glucose_measurements.csv is the file containing the continuous blood glucose level measurements of the patients. The file is composed of more than 22.6 million records that constitute the time series of continuous blood glucose level measurements. It includes the following variables:

    Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.

    Measurement_date – Date of the blood glucose level measurement. Format: YYYY-MM-DD.

    Measurement_time – Time of the blood glucose level measurement. Format: HH:MM:SS.

    Measurement – Value of the blood glucose level measurement in mg/dL. Values: ranging from 40 to 500.

    Biochemical_parameters.csv

    Biochemical_parameters.csv is the file containing data of the biochemical tests performed on patients to measure their biochemical parameters. This file is composed of 87482 records and includes the following variables:

    Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.

    Reception_date – Date of receipt in the laboratory of the sample to measure the biochemical parameter. Format: YYYY-MM-DD.

    Name – Name of the measured biochemical parameter. Values: 'Potassium', 'HDL cholesterol', 'Gammaglutamyl Transferase (GGT)', 'Creatinine', 'Glucose', 'Uric acid', 'Triglycerides', 'Alanine transaminase (GPT)', 'Chlorine', 'Thyrotropin (TSH)', 'Sodium', 'Glycated hemoglobin (Ac)', 'Total cholesterol', 'Albumin (urine)', 'Creatinine (urine)', 'Insulin', 'IA ANTIBODIES'.

    Value – Value of the biochemical parameter. Values: ranging from -4.0 to 6446.74.

    Diagnostics.csv

    Diagnostics.csv is the file containing diagnoses of diabetes mellitus complications or other diseases that patients have in addition to type 1 diabetes mellitus. This file is composed of 1757 records and includes the following variables:

    Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.

    Code – ICD-9-CM diagnosis code. Values: subset of 594 of the ICD-9-CM codes (https://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes).

    Description – ICD-9-CM long description. Values: subset of 594 of the ICD-9-CM long description (https://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes).

    Technical Validation

    Blood glucose level measurements are collected using FreeStyle Libre devices, which are widely used for healthcare in patients with T1D. Abbott Diabetes Care, Inc., Alameda, CA, USA, the manufacturer company, has conducted validation studies of these devices concluding that the measurements made by their sensors compare to YSI analyzer devices (Xylem Inc.), the gold standard, yielding results of 99.9% of the time within zones A and B of the consensus error grid. In addition, other studies external to the company concluded that the accuracy of the measurements is adequate.

    Moreover, it was also checked in most cases the blood glucose level measurements per patient were continuous (i.e. a sample at least every 15 minutes) in the Glucose_measurements.csv file as they should be.

    Usage Notes

    For data downloading, it is necessary to be authenticated on the Zenodo platform, accept the Data Usage Agreement and send a request specifying full name, email, and the justification of the data use. This request will be processed by the Secretary of the Department of Computer Engineering, Automatics, and Robotics of the University of Granada and access to the dataset will be granted.

    The files that compose the dataset are CSV type files delimited by commas and are available in T1DiabetesGranada.zip. A Jupyter Notebook (Python v. 3.8) with code that may help to a better understanding of the dataset, with graphics and statistics, is available in UsageNotes.zip.

    Graphs_and_stats.ipynb

    The Jupyter Notebook generates tables, graphs and statistics for a better understanding of the dataset. It has four main sections, one dedicated to each file in the dataset. In addition, it has useful functions such as calculating the patient age, deleting a patient list from a dataset file and leaving only a patient list in a dataset file.

    Code Availability

    The dataset was generated using some custom code located in CodeAvailability.zip. The code is provided as Jupyter Notebooks created with Python v. 3.8. The code was used to conduct tasks such as data curation and transformation, and variables extraction.

    Original_patient_info_curation.ipynb

    In the Jupyter Notebook is preprocessed the original file with patient data. Mainly irrelevant rows and columns are removed, and the sex variable is recoded.

    Glucose_measurements_curation.ipynb

    In the Jupyter Notebook is preprocessed the original file with the continuous glucose level measurements of the patients. Principally rows without information or duplicated rows are removed and the variable with the timestamp is transformed into two new variables, measurement date and measurement time.

    Biochemical_parameters_curation.ipynb

    In the Jupyter Notebook is preprocessed the original file with patient data of the biochemical tests performed on patients to measure their biochemical parameters. Mainly irrelevant rows and columns are removed and the variable with the name of the measured biochemical parameter is translated.

    Diagnostic_curation.ipynb

    In the Jupyter Notebook is preprocessed the original file with patient data of the diagnoses of diabetes mellitus complications or other diseases that patients have in addition to T1D.

    Get_patient_info_variables.ipynb

    In the Jupyter Notebook it is coded the feature extraction process from the files Glucose_measurements.csv, Biochemical_parameters.csv and Diagnostics.csv to complete the file Patient_info.csv. It is divided into six sections, the first three to extract the features from each of the mentioned files and the next three to add the extracted features to the resulting new file.

    Data Usage Agreement

    The conditions for use are as follows:

    You confirm that you will not attempt to re-identify research participants for any reason, including for re-identification theory research.

    You commit to keeping the T1DiabetesGranada dataset confidential and secure and will not redistribute data or Zenodo account credentials.

    You will require

  13. f

    Enhancing UNCDF Operations: Power BI Dashboard Development and Data Mapping

    • figshare.com
    Updated Jan 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maryam Binti Haji Abdul Halim (2025). Enhancing UNCDF Operations: Power BI Dashboard Development and Data Mapping [Dataset]. http://doi.org/10.6084/m9.figshare.28147451.v1
    Explore at:
    Dataset updated
    Jan 6, 2025
    Dataset provided by
    figshare
    Authors
    Maryam Binti Haji Abdul Halim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This project focuses on data mapping, integration, and analysis to support the development and enhancement of six UNCDF operational applications: OrgTraveler, Comms Central, Internal Support Hub, Partnership 360, SmartHR, and TimeTrack. These apps streamline workflows for travel claims, internal support, partnership management, and time tracking within UNCDF.Key Features and Tools:Data Mapping for Salesforce CRM Migration: Structured and mapped data flows to ensure compatibility and seamless migration to Salesforce CRM.Python for Data Cleaning and Transformation: Utilized pandas, numpy, and APIs to clean, preprocess, and transform raw datasets into standardized formats.Power BI Dashboards: Designed interactive dashboards to visualize workflows and monitor performance metrics for decision-making.Collaboration Across Platforms: Integrated Google Collab for code collaboration and Microsoft Excel for data validation and analysis.

  14. Taylor Swift | The Eras Tour Official Setlist Data

    • kaggle.com
    Updated May 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yuka_with_data (2024). Taylor Swift | The Eras Tour Official Setlist Data [Dataset]. https://www.kaggle.com/datasets/yukawithdata/taylor-swift-the-eras-tour-official-setlist-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 13, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    yuka_with_data
    Description

    💁‍♀️Please take a moment to carefully read through this description and metadata to better understand the dataset and its nuances before proceeding to the Suggestions and Discussions section.

    Dataset Description:

    This dataset provides a comprehensive collection of setlists from Taylor Swift’s official era tours, curated expertly by Spotify. The playlist, available on Spotify under the title "Taylor Swift The Eras Tour Official Setlist," encompasses a diverse range of songs that have been performed live during the tour events of this global artist. Each dataset entry corresponds to a song featured in the playlist.

    Taylor Swift, a pivotal figure in both country and pop music scenes, has had a transformative impact on the music industry. Her tours are celebrated not just for their musical variety but also for their theatrical elements, narrative style, and the deep emotional connection they foster with fans worldwide. This dataset aims to provide fans and researchers an insight into the evolution of Swift's musical and performance style through her tours, capturing the essence of what makes her tour unique.

    Data Collection and Processing:

    Obtaining the Data: The data was obtained directly from the Spotify Web API, specifically focusing on the setlist tracks by the artist. The Spotify API provides detailed information about tracks, artists, and albums through various endpoints.

    Data Processing: To process and structure the data, Python scripts were developed using data science libraries such as pandas for data manipulation and spotipy for API interactions, specifically for Spotify data retrieval.

    Workflow:

    Authentication API Requests Data Cleaning and Transformation Saving the Data

    Attribute Descriptions:

    • artist_name: the name of the artist (Taylor Swift)
    • track_name: the title of the track
    • is_explicit: Indicates whether the track contains explicit content
    • album_release_date: The date when the track was released
    • genres: A list of genres associated with Beyoncé
    • danceability: A measure from 0.0 to 1.0 indicating how suitable a track is for - dancing based on a combination of musical elements
    • valence: A measure from 0.0 to 1.0 indicating the musical positiveness conveyed by a track
    • energy: A measure from 0.0 to 1.0 representing a perceptual measure of intensity and activity
    • loudness: The overall loudness of a track in decibels (dB)
    • acousticness: A measure from 0.0 to 1.0 whether the track is acoustic
    • instrumentalness: Predicts whether a track contains no vocals
    • liveness: Detects the presence of an audience in the recordings speechiness: Detects the presence of spoken words in a track
    • key: The key the track is in. Integers map to pitches using standard Pitch Class notation
    • tempo: The overall estimated tempo of a track in beats per minute (BPM)
    • mode: Modality of the track
    • duration_ms: The length of the track in milliseconds
    • time_signature: An estimated overall time signature of a track
    • popularity: A score between 0 and 100, with 100 being the most popular

    Note: Popularity score reflects the score recorded on the day that retrieves this dataset. The popularity score could fluctuate daily.

    Potential Applications:

    • Predictive Analytics: Researchers might use this dataset to predict future setlist choices for tours based on album success, song popularity, and fan feedback.

    Disclaimer and Responsible Use:

    This dataset, derived from Spotify focusing on Taylor Swift's The Eras Tour setlist data, is intended for educational, research, and analysis purposes only. Users are urged to use this data responsibly, ethically, and within the bounds of legal stipulations.

    • Compliance with Terms of Service: Users should adhere to Spotify's Terms of Service and Developer Policies when utilizing this dataset.
    • Copyright Notice: The dataset presents music track information including names and artist details for analytical purposes and does not convey any rights to the music itself. Users must ensure that their use does not infringe on the copyright holders' rights. Any analysis, distribution, or derivative work should respect the intellectual property rights of all involved parties and comply with applicable laws.
    • No Warranty Disclaimer: The dataset is provided "as is," without warranty, and the creator disclaims any legal liability for its use by others.
    • Ethical Use: Users are encouraged to consider the ethical implications of their analyses and the potential impact on artists and the broader community.
    • Data Accuracy and Timeliness: The dataset reflects a snapshot in time and may not represent the most current information available. Users are encouraged to verify the data's accuracy and timeliness.
    • Source Verification: For the most accurate and up-to-date information, users are encouraged to refer directly to Spotify's official website.
    • Independence Declaration: ...
  15. Z

    Data from: Positive effects of public breeding on U.S. rice yields under...

    • data.niaid.nih.gov
    Updated Sep 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diane R. Wang (2023). Positive effects of public breeding on U.S. rice yields under future climate scenarios [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8040082
    Explore at:
    Dataset updated
    Sep 14, 2023
    Dataset provided by
    Jeremy D. Edwards
    Rongkui Han
    Sajad Jamshidi
    Diane R. Wang
    Susan R. McCouch
    Anna McClung
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    This data repository offers comprehensive resources, including datasets, Python scripts, and models associated with the study entitled, "Positive effects of public breeding on U.S. rice yields under future climate scenarios". The repository contains three models: a PCA model for data transformation, along with two meta-machine learning models for predictive analysis. Additionally, three Python scripts are available to facilitate the creation of training datasets and machine-learning models. The repository also provides tabulated weather, genetic, and county-level rice yield information specific to the southern U.S. region, which serves as the primary data inputs for our research. The focus of our study lies in modeling and predicting rice yields, incorporating factors such as molecular marker variation, varietal productivity, and climate, particularly within the Southern U.S. rice growing region. This region encompasses Arkansas, Louisiana, Texas, Mississippi, and Missouri, which collectively account for 85% of total U.S. rice production. By digitizing and merging county-level variety acreage data from 1970 to 2015 with genotyping-by-sequencing data, we estimate annual county-level allele frequencies. These frequencies, in conjunction with county-level weather and yield data, are employed to develop ten machine-learning models for yield prediction. An ensemble model, consisting of a two-layer meta-learner, combines the predictions of all ten models and undergoes external evaluation using historical Uniform Regional Rice Nursery trials (1980-2018) conducted within the same states. Lastly, the ensemble model, coupled with forecasted weather data from the Coupled Model Intercomparison Project, is employed to predict future production across the 110 rice-growing counties, considering various groups of germplasm.

    This study was supported by USDA NIFA 2014-67003-21858 and USDA NIFA 2022-67013-36205.

  16. Data from: A Dataset of Contributor Activities in the NumFocus Open-Source...

    • zenodo.org
    json, zip
    Updated Jan 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Youness Hourri; Youness Hourri; Alexandre Decan; Tom Mens; Alexandre Decan; Tom Mens (2025). A Dataset of Contributor Activities in the NumFocus Open-Source Community [Dataset]. http://doi.org/10.5281/zenodo.14230406
    Explore at:
    zip, jsonAvailable download formats
    Dataset updated
    Jan 16, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Youness Hourri; Youness Hourri; Alexandre Decan; Tom Mens; Alexandre Decan; Tom Mens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The NumFocus dataset provides a comprehensive representation of contributor activity across 58 open-source projects supported by the NumFocus organization. Spanning a three-year observation period (January 2022 to December 2024), this dataset captures the dynamics of open-source collaboration within a defined community of scientific and data-driven software projects.

    To address the challenges of interpreting raw GitHub event logs, the dataset introduces two structured levels of abstraction: actions and activities. Actions offer a detailed view of individual operations, such as creating branches or pushing commits, while activities aggregate related actions into high-level tasks, such as merging pull requests or resolving issues. This hierarchy bridges the gap between granular operations and contributors’ broader intentions.

    The primary dataset focuses on activities, providing a high-level overview of contributor behavior. For users requiring more granular analysis, a complementary dataset of actions is also included.

    The dataset is accompanied by a Python-based command-line tool that automates the transformation of raw GitHub event logs into structured actions and activities. The tool, along with its configurable mapping files and scripts, is publicly available at https://github.com/uhourri/ghmap" target="_blank" rel="noopener">ghmap.

    The dataset is distributed across the following files:

    1. NumFocus_Jan22-Dec24_GH_Actions.zip: Contains 2,716,910 actions in JSON Lines format, capturing individual contributor operations.
    2. NumFocus_Jan22-Dec24_GH_Activities.zip: Contains 2,278,299 activities in JSON Lines format, representing high-level tasks derived from grouped actions.
    3. action_schema.json: A validation schema in JSON format to ensure consistency in interpreting the actions dataset.
    4. activity_schema.json: A validation schema for validating and integrating the activities dataset.

    Actions: Low-Level Operations

    Actions represent the most granular level of recorded operations, derived from raw GitHub events. Each action corresponds to a single, well-defined contributor operation, such as pushing code to a repository, opening a pull request, or commenting on an issue. This level preserves the technical details necessary for tracing individual operations while standardizing event data to facilitate analysis across repositories.

    Each action record captures a single contributor operation and includes the following attributes:

    • action: Specifies the type of operation (e.g., PushCommits, OpenPullRequest, or CreateBranch).
    • event_id: A unique identifier linking the action to its originating GitHub event.
    • date: The timestamp of the action, recorded in ISO 8601 format.
    • actor: Contains details about the contributor performing the action, including a persistent id and their GitHub login.
    • repository: Provides information about the repository where the action occurred, including its id, name, and associated organisation.
    • details: Stores additional attributes specific to the action type, extracted from the payload of the corresponding GitHub event (e.g., for a PushCommits action, the details include the branch reference and the number of commits; for an OpenPullRequest action, the details include the pull request’s title, labels, state, and creation and update dates).

    The dataset encompasses 24 distinct action types, each derived from specific GitHub events and representing a well-defined contributor operation:

    1. AddMember: Tracks the addition of a new collaborator to a repository.
    2. CloseIssue: Indicates that an issue has been marked as closed by a contributor.
    3. ClosePullRequest: Represents the closure of a pull request without merging its changes.
    4. CommentCommit: Captures comments made directly on specific commits within a repository.
    5. CreateBranch: Logs the creation of a new branch within a repository.
    6. CreateIssueComment: Tracks comments added to existing issues.
    7. CreatePullRequestComment: Records comments made on pull requests, including discussions on the changes proposed.
    8. CreatePullRequestReview: Represents the submission of a review for a pull request.
    9. CreatePullRequestReviewComment: Captures inline comments added during a pull request review process.
    10. CreateRepository: Represents the creation of a new GitHub repository.
    11. CreateTag: Logs the creation of a tag, often associated with versioning or releases.
    12. DeleteBranch: Indicates that an existing branch has been deleted from a repository.
    13. DeleteTag: Tracks the deletion of a tag within a repository.
    14. ForkRepository: Captures the action of forking a repository to create a copy under a different account.
    15. MakeRepositoryPublic: Represents the change of a private repository’s visibility to public.
    16. ManageWikiPage: Logs edits or updates made to a repository’s wiki pages.
    17. MergePullRequest: Indicates that a pull request has been merged, integrating its changes into the base branch.
    18. OpenIssue: Captures the creation of a new issue within a repository.
    19. OpenPullRequest: Represents the initiation of a new pull request to propose changes.
    20. PublishRelease: Tracks the publication of a release, often tied to specific tags and associated metadata.
    21. PushCommits: Records push events, detailing branches and commits included in the operation.
    22. ReopenIssue: Indicates that a previously closed issue has been reopened for further action.
    23. ReopenPullRequest: Captures the reopening of a previously closed pull request.
    24. StarRepository: Tracks when a user stars a repository to bookmark it or show support.

    Example of action record:

    {
      "action":"CloseIssue",
      "event_id":"26170139709",
      "date":"2023-01-01T20:19:58Z",
      "actor":{
       "id":1282691,
       "login":"KristofferC"
      },
      "repository":{
       "id":1644196,
       "name":"JuliaLang/julia",
       "organisation":"JuliaLang",
       "organisation_id":743164
      },
      "details":{
       "issue":{
         "id":1515182791,
         "number":48062,
         "title":"Bad default number of BLAS threads on 1.9?",
         "state":"closed",
         "author":{
          "id":1282691,
          "login":"KristofferC"
         },
         "labels":[
          {
            "name":"linear algebra",
            "description":"Linear algebra"
          }
         ],
         "created_date":"2022-12-31T18:49:47Z",
         "updated_date":"2023-01-01T20:19:58Z",
         "closed_date":"2023-01-01T20:19:57Z"
       }
      }
    }

    Activities: High-Level Intent Representation

    To provide a more meaningful abstraction, actions are grouped into activities. Activities represent cohesive, high-level tasks performed by contributors, such as merging a pull request, publishing a release, or resolving an issue. This higher-level grouping removes noise from low-level event logs and aligns with the contributor's intent .

    Activities are constructed based on logical and temporal criteria. For example, merging a pull request may involve several distinct actions: closing the pull request, pushing the merged changes, and deleting the source branch. By aggregating these actions, the activity more accurately reflects the contributor’s intent.

    Each activity record represents a cohesive, high-level task and includes the following attributes:

    • activity: Specifies the type of activity (e.g., MergePullRequest, ReviewPullRequest, or PushCommits).
    • start_date: Indicates when the activity began, recorded in ISO 8601 format.
    • end_date: Indicates when the activity concluded, recorded in ISO 8601 format.
    • actor: Contains details about the contributor performing the activity, including a persistent id and their GitHub login.
    • repository: Provides details about the repository where the activity occurred, including its id, name, and associated organisation.
    • actions: A list of the actions that constitute the activity, retaining their original metadata for traceability.

    The dataset includes 21 distinct activity types, which aggregate related actions based on logical and temporal criteria to represent contributors’ high-level intent:

    1. AddContributors: Tracks the addition of one or more contributors to a repository within a short timeframe.
    2. CloseIssue: Represents the resolution of an issue, optionally accompanied by a comment clarifying the closure.
    3. ClosePullRequest: Indicates the closure of a pull request without merging its changes, optionally documented with a comment.
    4. CommentCommits: Logs comments made directly on specific commits, often as part of discussions or reviews.
    5. CommentIssue: Captures multiple comments on a specific

  17. Pyrho Validation - Check re-gridded periodic data

    • figshare.com
    bin
    Updated May 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jimmy Shen (2022). Pyrho Validation - Check re-gridded periodic data [Dataset]. http://doi.org/10.6084/m9.figshare.19908193.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    May 27, 2022
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Jimmy Shen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Validation data for "A representation-independent electronic charge density database for crystalline materials"

    Each directory is named after the task_id field from the following query to the Materials Project database:

    from pymatgen.ext.matproj import MPRester
    
    with MPRester() as m:
      q_res = m.query(
        criteria={
          "nelements": 1,
          "e_above_hull": {"$lt": 0.1},
          "nsites": {"$lt": 20},
          "e_above_hull": {"$lt": 0.00001},
        },
        properties=["energy", "structure", "e_above_hull", "task_id", "exp"],
      )
    

    There are 117 directories in all. Each directory contains the POSCAR file of the unit cell, the CHGCAR's of the unit cell and two different supercells:

    sc1 = uc * [
      [1, 1, 0],
      [1, -1, 0],
      [0, 0, 1],
    ]
    sc2 = uc * [
      [2, 0, 0],
      [0, 2, 0],
      [0, 0, 2],
    ]
    

    and the output of the validation analysis validate_sc.json which should look like this:

    {
     "sc1": {
      "1": 0.0002730357775883091,
      "2": 6.913892646285771e-05,
      "4": 1.7710165019594026e-05
     },
     "sc2": {
      "1": 0.0002667279377434944,
      "2": 6.911585183768033e-05,
      "4": 2.6712034073784627e-05
     },
     "formula": "H2"
    }
    

    The output of the validation analysis is created using the validate_sc.py script, which calculates the average of the difference between the re-gridded and explicitly calculated charge densities. The differences are stored in units of electrons/Angstrom^3 for each supercell and up-sampling factors 1/2/4. Once the JSON files are in place, the plot from the paper can be generated using the plot.py script.

  18. f

    Dataset for manuscript: Phylogenetic and genomic insights into the evolution...

    • figshare.com
    zip
    Updated Jan 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Puguang Zhao; Lingyun Chen (2025). Dataset for manuscript: Phylogenetic and genomic insights into the evolution of terpenoid biosynthesis genes in diverse plant lineages [Dataset]. http://doi.org/10.6084/m9.figshare.27187977.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 8, 2025
    Dataset provided by
    figshare
    Authors
    Puguang Zhao; Lingyun Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset for the manuscript entitled: Phylogenetic and genomic insights into the evolution of terpenoid biosynthesis genes in diverse plant lineages1 "Species name.xlsx": This Excel file includes species name abbreviations and their full names.2 "Phylogeny" Folder:(Supplemental Figures S2–S25) This folder contains files related to the phylogenetic analyses for genes, including phylogenetic trees and corresponding amino acid sequences. 2.1 Folder Content: Phylogenetic tree files (.raxml_bs.tre): These files are the phylogenetic trees of each gene, generated using RAxML. Amino acid sequence files (.fa): These files contain the amino acid sequences used to construct the phylogenetic trees. 2.2 The step to construct phylogenetic trees: you can use the following command in a Linux environment: python2 fasta_to_tree.py ./input_dir/ 4 aa y # Place amino acid sequence files (*.fa) in the 'input_dir' directory. The script 'fasta_to_tree.py' is available at: https://bitbucket.org/yanglab/phylogenomic_dataset_construction/src/master/scripts/3 "Expression" Folder:(Figures 5A and 5B; Supplemental Figures S33 and S34) This folder is used for analyzing gene expression levels, including extracting, processing, and visualizing expression level. In this study, gene expression levels are represented using Transcripts Per Million (TPM). 3.1 Folder Content: '1total-TPM' and '2total-TPM': RNA-seq reads for each species were mapped to CDSs of each species using Salmon v1.3.0. The sequence IDs and TPM values in the output files 'quant.sf' of each species were extracted and combined into 'total_TPM'. As the file size of 'total_TPM' exceeded the allowed size in Excel, we split it into '1total-TPM' and '2total-TPM'. Both the two files are in format with two columns: ID: A unique identifier for each gene. TPM: The normalized expression value for the corresponding gene. These files were used to extract the expression levels of target genes. 'input/' directory: This directory contains the gene ID files for which the expression levels need to be extracted. The gene IDs are derived from the amino acid sequence files in the "Phylogeny" folder. From these sequence files, the gene IDs for the target species are extracted. For example, you can use the following command in a Linux environment: grep '>' DXR-MEP.fa > DXR.xlsx #'DXR-MEP.fa' is the input file from the "Phylogeny" folder in this case. 'DXR.xlsx' is the output file, which contains the extracted gene IDs. After extracting the gene IDs, make sure to add a column header named "ID" in the output file (DXR.xlsx). Python Scripts (.py): These three scripts are executed in Visual Studio Code, using Python 3.10.4 as the runtime environment. 'gene_expression-average.py': The processed data generated by this script is used for downstream visualization tasks, such as creating heatmaps (Figure 5A) and raincloud plots (Figure 5B). Extracts the gene expression levels from '1total-TPM' and '2total-TPM' for the gene IDs in the 'input' directory. Calculates the average expression level of terpenoid biosynthesis genes for each species and the results will be stored in the summary file 'average-expression.xlsx'. 'gene_expression-3highest-average.py': The processed data generated by this script is used for downstream visualization tasks, such as creating heatmaps (Supplemental Figure S33A) and raincloud plots (Supplemental Figure S33B). Extracts the expression levels for the top three highest-expressed genes of terpenoid biosynthesis genes for each species. Calculates the average of the top three values and the results will be stored in the summary file '3highest-average-expression.xlsx'. 'gene_expression-sum.py': The processed data generated by this script is used for downstream visualization tasks, such as creating heatmaps (Supplemental Figure S34A) and raincloud plots (Supplemental Figure S34B). Extracts the gene expression levels from 1total-TPM and 2total-TPM for the gene IDs in the 'input' directory. Calculates the sum of expression levels of terpenoid biosynthesis genes for each species and the results will be stored in the summary file 'sum-expression.xlsx'. CSV Files (.csv): These files are used as input for generating visualizations (e.g., heatmaps and raincloud plots). 'average-expression-log.csv': (Figure 5A and 5B) The file is generated by running the script 'gene_expression-average.py', which produces the file 'average-expression.xlsx'. The 'average-expression.xlsx' file is then converted to a CSV format and further processed with a log2 transformation to create 'average-expression-log.csv'. This file contains processed data for the average expression levels of genes. All values are log2-transformed for better visualization and analysis. '3highest-average-expression-log.csv': (Supplemental Figure S33A and S33B) The file is generated by running the script 'gene_expression-3highest-average.py', which produces the file 'average-expression.xlsx'. The 'average-expression.xlsx' file is then converted to a CSV format and further processed with a log2 transformation to create '3highest-average-expression-log.csv'. This file contains the top three highest average expression levels for each gene family. All values are log2-transformed. 'sum-expression-log.csv': (Supplemental Figure S34A and S34B) The file is generated by running the script 'gene_expression-sum.py', which produces the file 'sum-expression.xlsx'. The 'sum-expression.xlsx' file is then converted to a CSV format and further processed with a log2 transformation to create 'sum-expression-log.csv'. Contains the summed expression levels for genes across selected species. All values are log2-transformed. 'heatmap.r': This file is used to generate heatmaps for visualizing gene expression levels. The raincloud plots are generated by tvBOT (https://www.chiplot.online/). 3.2 Workflow: 3.2.1. Preparation: Place the reference files ('1total-TPM' and '2total-TPM') and the gene ID files ('input/') in the respective directories. 3.2.2. Run Python Scripts: Use the following scripts based on the required analysis: Sum of expression levels: Run 'gene_expression-sum.py'. Average expression levels: Run 'gene_expression-average.py'. Top three highest averages: Run 'gene_expression-3highest-average.py'. Each script will generate a summary file in .xlsx format. 3.2.3. Post-process the Outputs: Convert the summary file (.xlsx) into .csv format. Apply log2 transformation to all expression values in the .csv files. 3.2.4. Generate heatmaps and raincloud plots: Use the R script (heatmap.r) to create heatmaps from the log2-transformed .csv files. (Figure 5A; Supplemental Figures S33A and S34A) Use the tvBOT (https://www.chiplot.online/) to create raincloud plots from the log2-transformed .csv files (Figure 5B; Supplemental Figures S33B and S34B).4 "KaKs" Folder: (Figure 5C) This folder is used to process Ka/Ks data for individual genes and generate boxplots to visualize the Ka/Ks distribution. 4.1 Folder Content: Python Scripts (.py): 'cdhit.py': This script is executed in a Linux environment. It is run directly in the command: python3 cdhit.py This script is designed to remove redundant sequences using the CD-HIT tool. The input for this script is a FASTA file containing the sequences of terpenoid biosynthesis genes. These genes are identified using the gene IDs provided in the "Expression" folder. The gene IDs are used to retrieve the corresponding sequences, which are then organized into a FASTA file to serve as the input for cd-hit. The non-redundant sequences obtained from this script are used for downstream Ka/Ks calculations. 'gene_pair.py': This script is executed in a Linux environment. It is run directly in the command: python3 gene_pair.py This script generates all possible gene pairs (including reverse pairs and no self-pairs) such as 'DXR_pair.id' from a list of gene IDs provided in an input file. The gene IDs used as input for this script are obtained from the non-redundant sequences generated by the 'cdhit.py'. 'boxplot.py': This script is executed in Visual Studio Code, using Python 3.10.4 as the runtime environment. This script processes the data in the 'input/' directory and creates a boxplot for Ka/Ks distribution (Figure 5C). 'DXR_pair.id': This file is an example of gene pairs specifically for the DXR gene family in the target species. This file is generated using the 'gene_pair.py'. A text file containing all possible gene pairs, formatted as: Gene1 Gene2 Gene1 Gene3 Gene2 Gene1 ... 'input/' directory: Contains Excel files with Ka/Ks ratios for each species. The Ka/Ks ratios are calculated using ParaAT and KaKs_Calculator for gene pairs. The resulting data is organized into an Excel file for each gene family, which contains two columns: 'Species' and 'Ka/Ks'. The Excel file is stored in the 'input/' directory. 4.2 Workflow: 4.2.1. Preparation: Use 'cdhit.py' to process gene sequences and generate non-redundant sequences. The genes used here are the same as those in the "Expression" folder. Extract gene IDs from the non-redundant sequences. Use 'gene_pair.py' to generate gene pairs from the extracted gene IDs. Calculate Ka/Ks values for the gene pairs using ParaAT and KaKs_Calculator, and then organize the results. Pre-process the output of KaKs_Calculator into .xlsx files with the following structure: Species: The species name. Ka/Ks: The calculated Ka/Ks ratios. 4.2.2. Run the Script: 'boxplot.py'

  19. ONE DATA Data Sience Workflows

    • zenodo.org
    json
    Updated Sep 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lorenz Wendlinger; Emanuel Berndl; Michael Granitzer; Lorenz Wendlinger; Emanuel Berndl; Michael Granitzer (2021). ONE DATA Data Sience Workflows [Dataset]. http://doi.org/10.5281/zenodo.4633704
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Sep 17, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lorenz Wendlinger; Emanuel Berndl; Michael Granitzer; Lorenz Wendlinger; Emanuel Berndl; Michael Granitzer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The ONE DATA data science workflow dataset ODDS-full comprises 815 unique workflows in temporally ordered versions.
    A version of a workflow describes its evolution over time, so whenever a workflow is altered meaningfully, a new version of this respective workflow is persisted.
    Overall, 16035 versions are available.

    The ODDS-full workflows represent machine learning workflows expressed as node-heterogeneous DAGs with 156 different node types.
    These node types represent various kinds of processing steps of a general machine learning workflow and are grouped into 5 categories, which are listed below.

    • Load Processors for loading or generating data (e.g. via a random number generator).
    • Save Processors for persisting data (possible in various data formats, via external connections or as a contained result within the ONE DATA platform) or for providing data to other places as a service.
    • Transformation Processors for altering and adapting data. This includes e.g. database-like operations such as renaming columns or joining tables as well as fully fledged dataset queries.
    • Quantitative Methods Various aggregation or correlation analysis, bucketing, and simple forecasting.
    • Advanced Methods Advanced machine learning algorithms such as BNN or Linear Regression. Also includes special meta processors that for example allow the execution of external workflows within the original workflow.

    Any metadata beyond the structure and node types of a workflow has been removed for anonymization purposes

    ODDS, a filtered variant, which enforces weak connectedness and only contains workflows with at least 5 different versions and 5 nodes, is available as the default version for supervised and unsupvervised learning.

    Workflows are served as JSON node-link graphs via networkx.

    They can be loaded into python as follows:

    import pandas as pd
    import networkx as nx
    import json
    
    with open('ODDS.json', 'r') as f:
      graphs = pd.Series(list(map(nx.node_link_graph, json.load(f)['graphs'])))

  20. Timac Fuel Distribution & Sales Dataset –

    • kaggle.com
    Updated May 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fatolu Peter (2025). Timac Fuel Distribution & Sales Dataset – [Dataset]. https://www.kaggle.com/datasets/olagokeblissman/timac-fuel-distribution-and-sales-dataset/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 31, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Fatolu Peter
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📝 Dataset Overview: This dataset represents real-world, enhanced transactional data from Timac Global Concept, one of Nigeria’s prominent players in fuel and petroleum distribution. It includes comprehensive sales records across multiple stations and product categories (AGO, PMS, Diesel, Lubricants, LPG), along with revenue and shift-based operational tracking.

    The dataset is ideal for analysts, BI professionals, and data science students aiming to explore fuel economy trends, pricing dynamics, and operational analytics.

    🔍 Dataset Features: Column Name Description Date Transaction date Station_Name Name of the fuel station AGO_Sales (L) Automotive Gas Oil sold in liters PMS_Sales (L) Premium Motor Spirit sold in liters Lubricant_Sales (L) Lubricant sales in liters Diesel_Sales (L) Diesel sold in liters LPG_Sales (kg) Liquefied Petroleum Gas sold in kilograms Total_Revenue (₦) Total revenue generated in Nigerian Naira AGO_Price Price per liter of AGO PMS_Price Price per liter of PMS Lubricant_Price Unit price of lubricants Diesel_Price Price per liter of diesel LPG_Price Price per kg of LPG Product_Category Fuel product type Shift Work shift (e.g., Morning, Night) Supervisor Supervisor in charge during shift Weekday Day of the week for each transaction

    🎯 Use Cases: Build Power BI dashboards to track fuel sales trends and shifts

    Perform revenue forecasting using time series models

    Analyze price dynamics vs sales volume

    Visualize station-wise performance and weekday sales patterns

    Conduct operational audits per supervisor or shift

    🧰 Best Tools for Analysis: Power BI, Tableau

    Python (Pandas, Matplotlib, Plotly)

    Excel for pivot tables and summaries

    SQL for fuel category insights

    👤 Created By: Fatolu Peter (Emperor Analytics) Data analyst focused on real-life data transformation in Nigeria’s petroleum, healthcare, and retail sectors. This is Project 11 in my growing portfolio of end-to-end analytics challenges.

    ✅ LinkedIn Post: ⛽ New Dataset Alert – Fuel Economy & Sales Data Now on Kaggle! 📊 Timac Fuel Distribution & Revenue Dataset (Nigeria – 500 Records) 🔗 Explore the data here

    Looking to practice business analytics, revenue forecasting, or operational dashboards?

    This dataset contains:

    Daily sales of AGO, PMS, Diesel, LPG & Lubricants

    Revenue breakdowns by station

    Shift & supervisor tracking

    Fuel prices across product categories

    You can use this to: ✅ Build Power BI sales dashboards ✅ Create fuel trend visualizations ✅ Analyze shift-level profitability ✅ Forecast revenue using Python or Excel

    Let’s put real Nigerian data to real analytical work. Tag me when you build with it—I’d love to celebrate your work!

    FuelAnalytics #KaggleDatasets #PowerBI #PetroleumIndustry #NigeriaData #RevenueForecasting #EmperorAnalytics #FatoluPeter #Project11 #TimacGlobal #RealWorldData

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
National Renewable Energy Laboratory (NREL) (2023). INTEGRATE - Inverse Network Transformations for Efficient Generation of Robust Airfoil and Turbine Enhancements [Dataset]. https://catalog.data.gov/dataset/integrate-inverse-network-transformations-for-efficient-generation-of-robust-airfoil-and-t

Data from: INTEGRATE - Inverse Network Transformations for Efficient Generation of Robust Airfoil and Turbine Enhancements

Related Article
Explore at:
Dataset updated
Jun 11, 2023
Dataset provided by
National Renewable Energy Laboratory (NREL)
Description

The INTEGRATE (Inverse Network Transformations for Efficient Generation of Robust Airfoil and Turbine Enhancements) project is developing a new inverse-design capability for the aerodynamic design of wind turbine rotors using invertible neural networks. This AI-based design technology can capture complex non-linear aerodynamic effects while being 100 times faster than design approaches based on computational fluid dynamics. This project enables innovation in wind turbine design by accelerating time to market through higher-accuracy early design iterations to reduce the levelized cost of energy. INVERTIBLE NEURAL NETWORKS Researchers are leveraging a specialized invertible neural network (INN) architecture along with the novel dimension-reduction methods and airfoil/blade shape representations developed by collaborators at the National Institute of Standards and Technology (NIST) learns complex relationships between airfoil or blade shapes and their associated aerodynamic and structural properties. This INN architecture will accelerate designs by providing a cost-effective alternative to current industrial aerodynamic design processes, including: Blade element momentum (BEM) theory models: limited effectiveness for design of offshore rotors with large, flexible blades where nonlinear aerodynamic effects dominate Direct design using computational fluid dynamics (CFD): cost-prohibitive Inverse-design models based on deep neural networks (DNNs): attractive alternative to CFD for 2D design problems, but quickly overwhelmed by the increased number of design variables in 3D problems AUTOMATED COMPUTATIONAL FLUID DYNAMICS FOR TRAINING DATA GENERATION - MERCURY FRAMEWORK The INN is trained on data obtained using the University of Marylands (UMD) Mercury Framework, which has with robust automated mesh generation capabilities and advanced turbulence and transition models validated for wind energy applications. Mercury is a multi-mesh paradigm, heterogeneous CPU-GPU framework. The framework incorporates three flow solvers at UMD, 1) OverTURNS, a structured solver on CPUs, 2) HAMSTR, a line based unstructured solver on CPUs, and 3) GARFIELD, a structured solver on GPUs. The framework is based on Python, that is often used to wrap C or Fortran codes for interoperability with other solvers. Communication between multiple solvers is accomplished with a Topology Independent Overset Grid Assembler (TIOGA). NOVEL AIRFOIL SHAPE REPRESENTATIONS USING GRASSMAN SPACES We developed a novel representation of shapes which decouples affine-style deformations from a rich set of data-driven deformations over a submanifold of the Grassmannian. The Grassmannian representation as an analytic generative model, informed by a database of physically relevant airfoils, offers (i) a rich set of novel 2D airfoil deformations not previously captured in the data , (ii) improved low-dimensional parameter domain for inferential statistics informing design/manufacturing, and (iii) consistent 3D blade representation and perturbation over a sequence of nominal shapes. TECHNOLOGY TRANSFER DEMONSTRATION - COUPLING WITH NREL WISDEM Researchers have integrated the inverse-design tool for 2D airfoils (INN-Airfoil) into WISDEM (Wind Plant Integrated Systems Design and Engineering Model), a multidisciplinary design and optimization framework for assessing the cost of energy, as part of tech-transfer demonstration. The integration of INN-Airfoil into WISDEM allows for the design of airfoils along with the blades that meet the dynamic design constraints on cost of energy, annual energy production, and the capital costs. Through preliminary studies, researchers have shown that the coupled INN-Airfoil + WISDEM approach reduces the cost of energy by around 1% compared to the conventional design approach. This page will serve as a place to easily access all the publications from this work and the repositories for the software developed and released through this project.

Search
Clear search
Close search
Google apps
Main menu