48 datasets found
  1. Z

    Data Cleaning, Translation & Split of the Dataset for the Automatic...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Köhler, Juliane (2022). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6957841
    Explore at:
    Dataset updated
    Aug 8, 2022
    Authors
    Köhler, Juliane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

    Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

    ger_train.csv – The German training set as CSV file.

    ger_validation.csv – The German validation set as CSV file.

    en_test.csv – The English test set as CSV file.

    en_train.csv – The English training set as CSV file.

    en_validation.csv – The English validation set as CSV file.

    splitting.py – The python code for splitting a dataset into train, test and validation set.

    DataSetTrans_de.csv – The final German dataset as a CSV file.

    DataSetTrans_en.csv – The final English dataset as a CSV file.

    translation.py – The python code for translating the cleaned dataset.

  2. u

    Surrogate flood model comparison - Datasets and python code

    • figshare.unimelb.edu.au
    bin
    Updated Jan 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niels Fraehr (2024). Surrogate flood model comparison - Datasets and python code [Dataset]. http://doi.org/10.26188/24312658.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 19, 2024
    Dataset provided by
    The University of Melbourne
    Authors
    Niels Fraehr
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data used for publication in "Assessment of surrogate models for flood inundation: The physics-guided LSG model vs. state-of-the-art machine learning models". Five surrogate models for flood inundation is to emulate the results of high-resolution hydrodynamic models. The surrogate models are compared based on accuracy and computational speed for three distinct case studies namely Carlisle (United Kingdom), Chowilla floodplain (Australia), and Burnett River (Australia).The dataset is structured in 5 files - "Carlisle", "Chowilla", "BurnettRV", "Comparison_results", and "Python_data". As a minimum to run the models the "Python_data" file and one of "Carlisle", "Chowilla", or "BurnettRV" are needed. We suggest to use the "Carlisle" case study for initial testing given its small size and small data requirement."Carlisle", "Chowilla", and "BurnettRV" files These files contain hydrodynamic modelling data for training and validation for each individual case study, as well as specific Python scripts for training and running the surrogate models in each case study. There are only small differences between each folder, depending on the hydrodynamic model trying to emulate and input boundary conditions (input features).Each case study file has the following folders:Geometry_data: DEM files, .npz files containing of the high-fidelity models grid (XYZ-coordinates) and areas (Same data is available for the low-fidelity model used in the LSG model), .shp files indicating location of boundaries and main flow paths (mainly used in the LSTM-SRR model). XXX_modeldata: Folder to storage trained model data for each XXX surrogate model. For example, GP_EOF_modeldata contains files used to store the trainined GP-EOF model.HD_model_data: High-fidelity (And low-fidelity) simulation results for all flood events of that case study. This folder also contains all boundary input conditions.HF_EOF_analysis: Storing of data used in the EOF analysis. EOF analysis is applied for the LSG, GP-EOF, and LSTM-EOF surrogate models. Results_data: Storing results of running the evaluation of the surrogate models.Train_test_split_data: The train-test-validation data split is the same for all surrogate models. The specific split for each cross-validation fold is stored in this folder.And Python files:YYY_event_summary, YYY_Extrap_event_summary: Files containing overview of all events, and which events are connected between the low- and high-fidelity models for each YYY case study.EOF_analysis_HFdata_preprocessing, EOF_analysis_HFdata: Preprocessing before EOF analysis and the EOF analysis of the high-fidelity data. This is used for the LSG, GP-EOF, and LSTM-EOF surrogate models.Evaluation, Evaluation_extrap: Scripts for evaluating the surrogate model for that case study and saving the results for each cross-validation fold.train_test_split: Script for splitting the flood datasets for each cross-validation fold, so all surrogate models train on the same data.XXX_training: Script for training each XXX surrogate model.XXX_preprocessing: Some surrogate models might rely on some information that needs to be generated before training. This is performed using these scripts."Comparison_results" fileFiles used for comparing surrogate models and generate the figures in the paper "Assessment of surrogate models for flood inundation: The physics-guided LSG model vs. state-of-the-art machine learning models". Figures are also included. "Python_data" fileFolder containing Python script with utility functions for setting up, training, and running the surrogate models, as well as for evaluating the surrogate models. This folder also contains a python_environment.yml file with all Python package versions and dependencies.This folder also contains two sub-folders:LSG_mods_and_func: Python scripts for using the LSG model. Some of these scripts are also utilized when working with the other surrogate models. SRR_method_master_Zhou2021: Scripts obtained from https://github.com/yuerongz/SRR-method. Small edits have for speed and use in this study.

  3. Z

    DustNet - structured data and Python code to reproduce the model,...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +1more
    Updated Jul 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nowak, T. E.; Augousti, Andy T.; Simmons, Benno I.; Siegert, Stefan (2024). DustNet - structured data and Python code to reproduce the model, statistical analysis and figures [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10631953
    Explore at:
    Dataset updated
    Jul 7, 2024
    Dataset provided by
    University of Exeter
    Kingston University
    Authors
    Nowak, T. E.; Augousti, Andy T.; Simmons, Benno I.; Siegert, Stefan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data and Python code used for AOD prediction with DustNet model - a Machine Learning/AI based forecasting.

    Model input data and code

    Processed MODIS AOD data (from Aqua and Terra) and selected ERA5 variables* ready to reproduce the DustNet model results or for similar forecasting with Machine Learning. These long-term daily timeseries (2003-2022) are provided as n-dimensional NumPy arrays. The Python code to handle the data and run the DustNet model** is included as Jupyter Notebook ‘DustNet_model_code.ipynb’. A subfolder with normalised and split data into training/validation/testing sets is also provided with Python code for two additional ML based models** used for comparison (U-NET and Conv2D). Pre-trained models are also archived here as TensorFlow files.

    Model output data and code

    This dataset was constructed by running the ‘DustNet_model_code.ipynb’ (see above). It consists of 1095 days of forecased AOD data (2020-2022) by CAMS, DustNet model, naïve prediction (persistence) and gridded climatology. The ground truth raw AOD data form MODIS is provided for comparison and statystical analysis of predictions. It is intended for a quick reproduction of figures and statystical analysis presented in DustNet introducing paper.

    *datasets are NumPy arrays (v1.23) created in Python v3.8.18.

    **all ML models were created with Keras in Python v3.10.10.

  4. d

    MC-LSTM papers, model runs

    • search.dataone.org
    • hydroshare.org
    Updated Dec 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan Martin Frame (2023). MC-LSTM papers, model runs [Dataset]. http://doi.org/10.4211/hs.d750278db868447dbd252a8c5431affd
    Explore at:
    Dataset updated
    Dec 30, 2023
    Dataset provided by
    Hydroshare
    Authors
    Jonathan Martin Frame
    Time period covered
    Jan 1, 1989 - Jan 1, 2015
    Area covered
    Description

    Runs from two papers exploring the use of mass conserving LSTM. Model results used in the papers are 1) model_outputs_for_analysis_extreme_events_paper.tar.gz, and 2) model_outputs_for_analysis_mass_balance_paper.tar.gz.

    The models here are trained/calibrated on three different time periods. Standard Time Split (time split 1): test period(1989-1999) is the same period used by previous studies which allows us to confirm that the deep learning models (LSTM andMC-LSTM) trained for this project perform as expected relative to prior work. NWM Time Split (time split 2): The second test period (1995-2014) allows us to benchmark against the NWM-Rv2, which does not provide data prior to 1995. Return period split: The third test period (based on return periods) allows us to benchmark only on water years that contain streamflow events that are larger (per basin) than anything seen in the training data (<= 5-year return periods in training and > 5-year return periods in testing).

    Also included are an ensemble of model runs for LSTM, MC-LSTM for the "standard" training period and two forcing products. These files are provided in the format "

    IMPORTANT NOTE: This python environment should be used to extract and load the data: https://github.com/jmframe/mclstm_2021_extrapolate/blob/main/python_environment.yml, as the pickle files serialized the data with specific versions of python libraries. Specifically, the pickle serialization was done with xarray=0.16.1.

    Code to interpret these runs can be found here: https://github.com/jmframe/mclstm_2021_extrapolate https://github.com/jmframe/mclstm_2021_mass_balance

    Papers are available here: https://hess.copernicus.org/preprints/hess-2021-423/

  5. Fruits Classification 🍇

    • kaggle.com
    zip
    Updated Apr 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DeepNets (2023). Fruits Classification 🍇 [Dataset]. https://www.kaggle.com/datasets/utkarshsaxenadn/fruits-classification/suggestions
    Explore at:
    zip(88954615 bytes)Available download formats
    Dataset updated
    Apr 9, 2023
    Authors
    DeepNets
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The fruit classification dataset is a collection of images of various fruits used for the purpose of the training and testing computer vision models. The dataset includes five different types of fruit: * Apples * Bananas * Grapes * Mangoes * Strawberries

    Each class contains 2000 images, resulting in a total of 10,000 images in the dataset.

    The images in the dataset are of various shapes, sizes, and colors, and have been captured under different lighting conditions. The dataset is useful for training and testing models that perform tasks such as object detection, image classification, and segmentation.

    The dataset can be used for various research projects, such as developing and testing new image classification algorithms, and for benchmarking existing algorithms. The dataset can also be used to train machine learning models that can be used in real-world applications, such as in the agricultural industry for fruit grading and sorting.

    Overall, the fruit classification dataset is a valuable resource for researchers and developers working in the field of computer vision, and its availability will help advance the development of new algorithms and technologies for image analysis and classification.

    Data Structure

    The data is split into three sets: training, validation, and testing. The training set is used to train the model, while the validation set is used to evaluate the model's performance during training and make adjustments as necessary. The testing set is used to evaluate the final performance of the model after training is complete.

    The dataset is split based on a ratio of 97% for training, 2% for validation, and 1% for testing. This means that the training set contains 9700 images (97% of the total), the validation set contains 200 images (2% of the total), and the testing set contains 100 images (1% of the total).

    Each class in the dataset is split into three sets based on the ratio. For example, for the "Apple" class, 97% (1940 images) are used for training, 2% (40 images) are used for validation, and 1% (20 images) are used for testing. This ensures that the distribution of classes is consistent across all three sets and that the model is trained on a representative sample of all classes.

    Overall, the split of the dataset into training, validation, and testing sets ensures that the model is robust and generalizes well to new, unseen data.

    Python Script

    The script provided creates train, validation, and test sets from a fruit image dataset by splitting the dataset into predetermined ratios, shuffling the images, and moving them to their respective directories.

  6. t

    Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...

    • researchdata.tuwien.at
    html, pdf, zip
    Updated Mar 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi (2025). Decoding Wayfinding: Analyzing Wayfinding Processes in the Outdoor Environment [Dataset]. http://doi.org/10.48436/m2ha4-t1v92
    Explore at:
    html, zip, pdfAvailable download formats
    Dataset updated
    Mar 19, 2025
    Dataset provided by
    TU Wien
    Authors
    Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    How To Cite?

    Alinaghi, N., Giannopoulos, I., Kattenbeck, M., & Raubal, M. (2025). Decoding wayfinding: analyzing wayfinding processes in the outdoor environment. International Journal of Geographical Information Science, 1–31. https://doi.org/10.1080/13658816.2025.2473599

    Link to the paper: https://www.tandfonline.com/doi/full/10.1080/13658816.2025.2473599

    Folder Structure

    The folder named “submission” contains the following:

    1. “pythonProject”: This folder contains all the Python files and subfolders needed for analysis.
    2. ijgis.yml: This file lists all the Python libraries and dependencies required to run the code.

    Setting Up the Environment

    1. Use the ijgis.yml file to create a Python project and environment. Ensure you activate the environment before running the code.
    2. The pythonProject folder contains several .py files and subfolders, each with specific functionality as described below.

    Subfolders

    1. Data_4_IJGIS

    • This folder contains the data used for the results reported in the paper.
    • Note: The data analysis that we explain in this paper already begins with the synchronization and cleaning of the recorded raw data. The published data is already synchronized and cleaned. Both the cleaned files and the merged files with features extracted for them are given in this directory. If you want to perform the segmentation and feature extraction yourself, you should run the respective Python files yourself. If not, you can use the “merged_…csv” files as input for the training.

    2. results_[DateTime] (e.g., results_20240906_15_00_13)

    • This folder will be generated when you run the code and will store the output of each step.
    • The current folder contains results created during code debugging for the submission.
    • When you run the code, a new folder with fresh results will be generated.

    Python Files

    1. helper_functions.py

    • Contains reusable functions used throughout the analysis.
    • Each function includes a description of its purpose and the input parameters required.

    2. create_sanity_plots.py

    • Generates scatter plots like those in Figure 3 of the paper.
    • Although the code has been run for all 309 trials, it can be used to check the sample data provided.
    • Output: A .png file for each column of the raw gaze and IMU recordings, color-coded with logged events.
    • Usage: Run this file to create visualizations similar to Figure 3.

    3. overlapping_sliding_window_loop.py

    • Implements overlapping sliding window segmentation and generates plots like those in Figure 4.
    • Output:
      • Two new subfolders, “Gaze” and “IMU”, will be added to the Data_4_IJGIS folder.
      • Segmented files (default: 2–10 seconds with a 1-second step size) will be saved as .csv files.
      • A visualization of the segments, similar to Figure 4, will be automatically generated.

    4. gaze_features.py & imu_features.py (Note: there has been an update to the IDT function implementation in the gaze_features.py on 19.03.2025.)

    • These files compute features as explained in Tables 1 and 2 of the paper, respectively.
    • They process the segmented recordings generated by the overlapping_sliding_window_loop.py.
    • Usage: Just to know how the features are calculated, you can run this code after the segmentation with the sliding window and run these files to calculate the features from the segmented data.

    5. training_prediction.py

    • This file contains the main machine learning analysis of the paper. This file contains all the code for the training of the model, its evaluation, and its use for the inference of the “monitoring part”. It covers the following steps:
    a. Data Preparation (corresponding to Section 5.1.1 of the paper)
    • Prepares the data according to the research question (RQ) described in the paper. Since this data was collected with several RQs in mind, we remove parts of the data that are not related to the RQ of this paper.
    • A function named plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5)) in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line.
    b. Training/Validation/Test Split
    • Splits the data for machine learning experiments (an explanation can be found in Section 5.1.1. Preparation of data for training and inference of the paper).
    • Make sure that you follow the instructions in the comments to the code exactly.
    • Output: The split data is saved as .csv files in the results folder.
    c. Machine and Deep Learning Experiments

    This part contains three main code blocks:

    iii. One for the XGboost code with correct hyperparameter tuning:
    Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of

    • MLP Network (Commented Out): This code was used for classification with the MLP network, and the results shown in Table 3 are from this code. If you wish to use this model, please comment out the following blocks accordingly.
    • XGBoost without Hyperparameter Tuning: If you want to run the code but do not want to spend time on the full training with hyperparameter tuning (as was done for the paper), just uncomment this part. This will give you a simple, untuned model with which you can achieve at least some results.
    • XGBoost with Hyperparameter Tuning: If you want to train the model the way we trained it for the analysis reported in the paper, use this block (the plots in Figure 7 are from this block). We ran this block with different feature sets and different segmentation files and created a simple bar chart from the saved results, shown in Figure 6.

    Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.

    d. Inference (Monitoring Part)
    • Final inference is performed using the monitoring data. This step produces a .csv file containing inferred labels.
    • Figure 8 in the paper is generated using this part of the code.

    6. sequence_analysis.py

    • Performs analysis on the inferred data, producing Figures 9 and 10 from the paper.
    • This file reads the inferred data from the previous step and performs sequence analysis as described in Sections 5.2.1 and 5.2.2.

    Licenses

    The data is licensed under CC-BY, the code is licensed under MIT.

  7. g

    Data from: JSON Dataset of Simulated Building Heat Control for System of...

    • gimi9.com
    • researchdata.se
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JSON Dataset of Simulated Building Heat Control for System of Systems Interoperability [Dataset]. https://gimi9.com/dataset/eu_https-doi-org-10-5878-1tv7-9x76/
    Explore at:
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Interoperability in systems-of-systems is a difficult problem due to the abundance of data standards and formats. Current approaches to interoperability rely on hand-made adapters or methods using ontological metadata. This dataset was created to facilitate research on data-driven interoperability solutions. The data comes from a simulation of a building heating system, and the messages sent within control systems-of-systems. For more information see attached data documentation. The data comes in two semicolon-separated (;) csv files, training.csv and test.csv. The train/test split is not random; training data comes from the first 80% of simulated timesteps, and the test data is the last 20%. There is no specific validation dataset, the validation data should instead be randomly selected from the training data. The simulation runs for as many time steps as there are outside temperature values available. The original SMHI data only samples once every hour, which we linearly interpolate to get one temperature sample every ten seconds. The data saved at each time step consists of 34 JSON messages (four per room and two temperature readings from the outside), 9 temperature values (one per room and outside), 8 setpoint values, and 8 actuator outputs. The data associated with each of those 34 JSON-messages is stored as a single row in the tables. This means that much data is duplicated, a choice made to make it easier to use the data. The simulation data is not meant to be opened and analyzed in spreadsheet software, it is meant for training machine learning models. It is recommended to open the data with the pandas library for Python, available at https://pypi.org/project/pandas/.

  8. Rescaled CIFAR-10 dataset

    • zenodo.org
    Updated Jun 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled CIFAR-10 dataset [Dataset]. http://doi.org/10.5281/zenodo.15188748
    Explore at:
    Dataset updated
    Jun 27, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
    Description

    Motivation

    The goal of introducing the Rescaled CIFAR-10 dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

    The Rescaled CIFAR-10 dataset was introduced in the paper:

    [1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

    with a pre-print available at arXiv:

    [2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

    Importantly, the Rescaled CIFAR-10 dataset contains substantially more natural textures and patterns than the MNIST Large Scale dataset, introduced in:

    [3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2

    and is therefore significantly more challenging.

    Access and rights

    The Rescaled CIFAR-10 dataset is provided on the condition that you provide proper citation for the original CIFAR-10 dataset:

    [4] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Tech. rep., University of Toronto.

    and also for this new rescaled version, using the reference [1] above.

    The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

    The dataset

    The Rescaled CIFAR-10 dataset is generated by rescaling 32×32 RGB images of animals and vehicles from the original CIFAR-10 dataset [4]. The scale variations are up to a factor of 4. In order to have all test images have the same resolution, mirror extension is used to extend the images to size 64x64. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

    There are 10 distinct classes in the dataset: “airplane”, “automobile”, “bird”, “cat”, “deer”, “dog”, “frog”, “horse”, “ship” and “truck”. In the dataset, these are represented by integer labels in the range [0, 9].

    The dataset is split into 40 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 40 000 samples from the original CIFAR-10 training set. The validation dataset, on the other hand, is formed from the final 10 000 image batch of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original CIFAR-10 test set.

    The h5 files containing the dataset

    The training dataset file (~5.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

    cifar10_with_scale_variations_tr40000_vl10000_te10000_outsize64-64_scte1p000_scte1p000.h5

    Additionally, for the Rescaled CIFAR-10 dataset, there are 9 datasets (~1 GB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]:

    cifar10_with_scale_variations_te10000_outsize64-64_scte0p500.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte0p595.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte0p707.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte0p841.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p000.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p189.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p414.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p682.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte2p000.h5

    These dataset files were used for the experiments presented in Figures 9, 10, 15, 16, 20 and 24 in [1].

    Instructions for loading the data set

    The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
    ('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

    The training dataset can be loaded in Python as:

    with h5py.File(`

    x_train = np.array( f["/x_train"], dtype=np.float32)
    x_val = np.array( f["/x_val"], dtype=np.float32)
    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_train = np.array( f["/y_train"], dtype=np.int32)
    y_val = np.array( f["/y_val"], dtype=np.int32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

    x_train = np.transpose(x_train, (0, 3, 1, 2))
    x_val = np.transpose(x_val, (0, 3, 1, 2))
    x_test = np.transpose(x_test, (0, 3, 1, 2))

    The test datasets can be loaded in Python as:

    with h5py.File(`

    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    The test datasets can be loaded in Matlab as:

    x_test = h5read(`

    The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.

  9. t

    Tour Recommendation Model

    • test.researchdata.tuwien.at
    bin, png +1
    Updated May 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar (2025). Tour Recommendation Model [Dataset]. http://doi.org/10.70124/akpf6-8p175
    Explore at:
    text/markdown, png, binAvailable download formats
    Dataset updated
    May 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 28, 2025
    Description

    Dataset Description for Tour Recommendation Model

    Context and Methodology:

    • Research Domain/Project:
      This dataset is part of the Tour Recommendation System project, which focuses on predicting user preferences and ratings for various tourist places and events. It belongs to the field of Machine Learning, specifically applied to Recommender Systems and Predictive Analytics.

    • Purpose:
      The dataset serves as the training and evaluation data for a Decision Tree Regressor model, which predicts ratings (from 1-5) for different tourist destinations based on user preferences. The model can be used to recommend places or events to users based on their predicted ratings.

    • Creation Methodology:
      The dataset was originally collected from a tourism platform where users rated various tourist places and events. The data was preprocessed to remove missing or invalid entries (such as #NAME? in rating columns). It was then split into subsets for training, validation, and testing the model.

    Technical Details:

    • Structure of the Dataset:
      The dataset is stored as a CSV file (user_ratings_dataset.csv) and contains the following columns:

      • place_or_event_id: Unique identifier for each tourist place or event.

      • rating: Rating given by the user, ranging from 1 to 5.

      The data is split into three subsets:

      • Training Set: 80% of the dataset used to train the model.

      • Validation Set: A small portion used for hyperparameter tuning.

      • Test Set: 20% used to evaluate model performance.

    • Folder and File Naming Conventions:
      The dataset files are stored in the following structure:

      • user_ratings_dataset.csv: The original dataset file containing user ratings.

      • tour_recommendation_model.pkl: The saved model after training.

      • actual_vs_predicted_chart.png: A chart comparing actual and predicted ratings.

    • Software Requirements:
      To open and work with this dataset, the following software and libraries are required:

      • Python 3.x

      • Pandas for data manipulation

      • Scikit-learn for training and evaluating machine learning models

      • Matplotlib for chart generation

      • Joblib for saving and loading the trained model

      The dataset can be opened and processed using any Python environment that supports these libraries.

    • Additional Resources:

      • The model training code, README file, and performance chart are available in the project repository.

      • For detailed explanation and code, please refer to the GitHub repository (or any other relevant link for the code).

    Further Details:

    • Dataset Reusability:
      The dataset is structured for easy use in training machine learning models for recommendation systems. Researchers and practitioners can utilize it to:

      • Train other types of models (e.g., regression, classification).

      • Experiment with different features or add more metadata to enrich the dataset.

    • Data Integrity:
      The dataset has been cleaned and preprocessed to remove invalid values (such as #NAME? or missing ratings). However, users should ensure they understand the structure and the preprocessing steps taken before reusing it.

    • Licensing:
      The dataset is provided under the CC BY 4.0 license, which allows free usage, distribution, and modification, provided that proper attribution is given.

  10. asl_hand_keypoints

    • kaggle.com
    zip
    Updated Jun 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rithvik Sreenish (2023). asl_hand_keypoints [Dataset]. https://www.kaggle.com/datasets/rithviksreenish/asl-hand-keypoints/code
    Explore at:
    zip(1273547 bytes)Available download formats
    Dataset updated
    Jun 28, 2023
    Authors
    Rithvik Sreenish
    Description

    It is an easy to use small size dataset for medium sized American sign language project.It works the best with LSTM layer. This dataset contains images and videos of people signing American Sign Language (ASL). The dataset is divided into two parts: images and videos.

    Purpose: This dataset can be used for a variety of purposes, such as:

    Developing ASL recognition models Improving the accuracy of ASL recognition models Researching ASL Improving the accessibility of ASL content Data Format: The file in the dataset are in .npy .

    Data Split: The dataset is split into a training set, a validation set, and a test set. The training set contains 80% of the data, the validation set contains 10% of the data, and the test set contains 10% of the data.

  11. CYP450 80/20 splits

    • figshare.com
    txt
    Updated Jan 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Siegle (2016). CYP450 80/20 splits [Dataset]. http://doi.org/10.6084/m9.figshare.1066108.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 19, 2016
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Daniel Siegle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data from an NIH HTS of 17K compounds against five isozymes of cytochrome P450 screening for inhibition. The activity score is taken from the NIH assay and merged with all the 2-D descriptors from the program Molecular Operating Environment (MOE). The datasets are separated by isozyme and then balanced between actives and inactives. Finally the balanced datasets are subject to an 80/20 training/test split. Link to python script of data manipulation...

  12. D

    Coupled mass-spring-damper system for nonlinear system identification -...

    • darus.uni-stuttgart.de
    Updated Feb 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Frank (2025). Coupled mass-spring-damper system for nonlinear system identification - actuated with random static inputs - synthetically generated [Dataset]. http://doi.org/10.18419/DARUS-4768
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 26, 2025
    Dataset provided by
    DaRUS
    Authors
    Daniel Frank
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    DFG
    Description

    Overview This dataset contains input-output data of a coupled mass-spring-damper system with a nonlinear force profile. The data was generated with statesim [1], a python package for simulating linear and nonlinear ODEs, for the system coupled-msd. The configuration .json files for the corresponding datasets (in-distribution and out-of-distribution) can be found in the respective folders. After creating the dataset, the files are stored in the raw folder. Then, they are split into subsets for training, testing, and validation and can be found in the processed folder; details about the splitting are found in the config.json file. The dataset can be used to test system identification algorithms and methods that aim to identify nonlinear dynamics from input-output measurements. The training dataset is used to optimize the model parameters, the validation set for hyperparameter optimization, and the test set only for the final evaluation. In [2], the authors use the same underlying dynamics to create their dataset. Input generation Input trajectories are piecewise constant trajectories. Noise Gaussian white noise of approximately 30dB is added at the output. Statistics The input and output size is one. In-distribution data: 1,500,000 data points Training: 120 trajectories of length 7500 Validation: 20 trajectories of length 7500 Test: 60 trajectories of length 7500 Out-of-distribution data: 10 times 3000 data points 10 different datasets were only used for testing. Each dataset contains 50 trajectories of length 6000. References Frank, D. statesim [Computer software]. https://github.com/Dany-L/statesim Revay, M., Wang, R., & Manchester, I. R. (2020). A convex parameterization of robust recurrent neural networks. IEEE Control Systems Letters, 5(4), 1363-1368.

  13. D

    Damped pendulum for nonlinear system identification - inputs are sampled...

    • darus.uni-stuttgart.de
    Updated Feb 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Frank (2025). Damped pendulum for nonlinear system identification - inputs are sampled from a multivariate-normal distribution - synthetically generated [Dataset]. http://doi.org/10.18419/DARUS-4770
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 26, 2025
    Dataset provided by
    DaRUS
    Authors
    Daniel Frank
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    DFG
    Description

    Overview This dataset contains input-output data of a damped nonlinear pendulum that is actuated at the mounting point. The data was generated with statesim [1], a python package for simulating linear and nonlinear ODEs, for the system actuated pendulum. The configuration .json files for the corresponding datasets (in-distribution and out-of-distribution) can be found in the respective folders. After creating the dataset, the files are stored in the raw folder. Then, they are split into subsets for training, testing, and validation and can be found in the processed folder; details about the splitting are found in the config.json file. The dataset can be used to test system identification algorithms and methods that aim to identify nonlinear dynamics from input-output measurements. The training dataset is used to optimize the model parameters, the validation set for hyperparameter optimization, and the test set only for the final evaluation. In [2], the authors used the same underlying dynamics to create their dataset but without damping terms. Input generation Input trajectories are sampled from a multivariate-normal distribution. Noise Gaussian white noise of approximately 30dB is added at the output. Statistics The input and output size is one. In-distribution data: 2 100 000 data points Training: 10 000 trajectories of length 150 Validation: 2 000 trajectories of length 150 Test: 2 000 trajectories of length 150 Out-of-distribution data: 7 times 100 000 data points 7 different datasets were only used for testing. Each dataset contains 200 trajectories of length 500. References Frank, D. statesim [Computer software]. https://github.com/Dany-L/statesim Lu, L., Jin, P., Pang, G., Zhang, Z., & Karniadakis, G. E. (2021). Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators. Nature machine intelligence, 3(3), 218-229.

  14. Train and Evaluation Code, Road Classification Models and Test set of the...

    • zenodo.org
    zip
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miguel Angel Manso Callejo; Miguel Angel Manso Callejo; Calimanut Ionut Cira; Calimanut Ionut Cira; Teresa Iturrioz; Teresa Iturrioz (2024). Train and Evaluation Code, Road Classification Models and Test set of the paper "Insights into the Effects of Image Overlap and Image Size on Semantic Segmentation Models Trained for Road Surface Area Extraction from Aerial Orthophotography" [Dataset]. http://doi.org/10.5281/zenodo.11494833
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 4, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Miguel Angel Manso Callejo; Miguel Angel Manso Callejo; Calimanut Ionut Cira; Calimanut Ionut Cira; Teresa Iturrioz; Teresa Iturrioz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the Python scripts built for training and evaluation of the implementation, together with the test data and the resulting road segmentation models corresponding to the paper "Insights into the Effects of Image Overlap and Image Size on Semantic Segmentation Models Trained for Road Surface Area Extraction from Aerial Orthophotography". The scripts make use of the Tensorflow with Keras framework and their additional required dependencies.

    The training and validation set is based on the binary SROADEX dataset (https://zenodo.org/records/6482346) that was re-split into tiles that feature the image resolutions (256 x 256, 512 x 512, and 1024 x 1024 pixels) and image overlaps (0% and 12.5%) considered in this study. The data have been generated using scripts developed in Python using Open Source libraries (GDAL/OGR and MapScript) for rasterization of vector cartography that represents the axes of the different types of roads (urban, interurban and rural). This binary road data contains information from 16 full orthoimages (28.5 km * 18.5 km) with spatial resolution of 0.5 m/pixel from the insular and peninsular Spanish territory. Due to the size on disk of approximately 492 gigabytes, this training and validation data is only available upon request from the corresponding author. The test set has been generated from a novel area from Palencia (Spain) and features 18 million pixels labelled with the positive "Road" class. The test sets are provided in the repository for each resolution (with no overlap), so that additional DL models can be evaluated on the same data and compared with the results achieved in this study.

    The structure of the information shared in this repository is as follows:
    The scripts have been grouped by tile resolution (256, 512 and 1024). First, the test set and the evaluation script can be found. For each tile resolution, there are two subfolders (corresponding to the "no overlap" and "12.5% overlap"). In each case, the Python scripts for training the models in the three repetitions are shared, and the trained models (H5 format) are shared in compressed form. Finally, for each resolution we also share the testing dataset which consists of two folders.

    The material is distributed under a CC-BY 4.0 license.

  15. CIFAR-100

    • kaggle.com
    zip
    Updated Oct 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmad (2023). CIFAR-100 [Dataset]. https://www.kaggle.com/datasets/pypiahmad/cifar-100
    Explore at:
    zip(168517947 bytes)Available download formats
    Dataset updated
    Oct 29, 2023
    Authors
    Ahmad
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The CIFAR-100 dataset is a widely used dataset for training and evaluating machine learning models, particularly in the realm of image classification and computer vision. Here are some key aspects of the CIFAR-100 dataset:

    1. Image Content:

      • The dataset consists of 60,000 color images.
      • Each image is 32x32 pixels in size.
      • The images are categorized into 100 classes, which are further grouped into 20 superclasses.
    2. Class Structure:

      • Each class contains 600 images.
      • The 100 classes are grouped into 20 superclasses, where each superclass encompasses 5 classes.
      • Examples of superclasses include "aquatic mammals," "vehicles 1," and "fruit and vegetables," among others.
      • Each image has a "fine" label indicating its class and a "coarse" label indicating its superclass.
    3. Data Split:

      • The dataset is divided into training and testing sets.
      • There are 50,000 training images and 10,000 testing images.
      • Each class has 500 training images and 100 testing images.
    4. File Format:

      • The dataset is available in binary version, Python version, and Matlab version.
      • In the Python version, the dataset files are in a "pickled" format which can be loaded using Python's pickle module.
      • In the binary version, each image's data is preceded by its labels and followed by the pixel values of the image.
    5. Dataset Usage:

      • CIFAR-100 is commonly used for training and evaluating models for image classification, object recognition, and other computer vision tasks.
      • It's also used for benchmarking different machine learning algorithms and comparing their performance on a standard dataset.
    6. Dataset Origin:

      • The CIFAR-100 dataset was collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton.
      • It's a subset of the 80 million tiny images dataset and is curated to provide a well-structured, labeled dataset for machine learning research.
    7. Downloading and Citing:

      • If you intend to use the CIFAR-100 dataset for your work, it's customary to cite the dataset's creators and the associated technical report.

    The CIFAR-100 dataset provides a robust, well-organized set of images for machine learning and computer vision applications, making it a valuable resource for researchers and practitioners in the field.

  16. T

    wider_face

    • tensorflow.org
    • opendatalab.com
    • +4more
    Updated Dec 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). wider_face [Dataset]. https://www.tensorflow.org/datasets/catalog/wider_face
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    WIDER FACE dataset is a face detection benchmark dataset, of which images are selected from the publicly available WIDER dataset. We choose 32,203 images and label 393,703 faces with a high degree of variability in scale, pose and occlusion as depicted in the sample images. WIDER FACE dataset is organized based on 61 event classes. For each event class, we randomly select 40%/10%/50% data as training, validation and testing sets. We adopt the same evaluation metric employed in the PASCAL VOC dataset. Similar to MALF and Caltech datasets, we do not release bounding box ground truth for the test images. Users are required to submit final prediction files, which we shall proceed to evaluate.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wider_face', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/wider_face-0.1.0.png" alt="Visualization" width="500px">

  17. S

    Python code data of attention-based dual-scale hierarchical LSTM for tool...

    • scidb.cn
    Updated Nov 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hao Guo; Kunpeng Zhu (2022). Python code data of attention-based dual-scale hierarchical LSTM for tool wear monitoring [Dataset]. http://doi.org/10.57760/sciencedb.06004
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 7, 2022
    Dataset provided by
    Science Data Bank
    Authors
    Hao Guo; Kunpeng Zhu
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The experiment is based on the common high speed milling data set to verify the robustness of the model to various tool types. The data set contains six sub data sets, corresponding to the wear process of six different types of tools. Three of the sub data sets contain tool wear labels, while the other three sub data sets do not. The tools used are all three edged 6mm ball cemented carbide tools, but their geometry and coating are different. The workpiece is Inconel 718, which is widely used for jet engine blade milling. The spindle speed is 10360rpm, and the cutting depth is 0.25mm. The tool cuts from the upper edge of the workpiece surface to the lower edge in a zigzag manner. In the whole milling process, the cutting length of each tool is about 0.1125m × 315pass = 35.44m. The cutting signal in Experiment 1 includes the cutting force signal collected by the three channel Kistler dynamometer and the vibration signal collected by the three channel Kistler accelerometer at a sampling rate of 50 kHz. Use the microscope LEICA MZ12 to measure the wear of the rear tool surface of the three teeth offline after each tool feeding. In this experiment, a cutting signal is collected every other period of time to predict the wear of the three teeth of the tool.The samples are divided into training set, evaluation set, test set and reconstruction set. The training set and evaluation set samples are from two kinds of tools, including 30000 and 4096 samples respectively; The samples of the test set are from another tool, including 9472 samples; The reconstruction set comes from the unlabeled data generated by the other three tools, including 40832 samples. Each sample contains three channels of cutting force signal and three channels of vibration signal. The sampling points of each channel signal are 2304. The following preprocessing steps are performed:1) Signal clippingSince the feed rate and sampling rate are constant throughout the experiment, the data set of each experiment can be approximately understood as a signal matrix evenly distributed on the workpiece surface, ignoring the slight difference in the number of sampling points for each tool path. The ordinate of the matrix corresponds to the index of the tool path times, and the abscissa corresponds to the index of the sampling point. Because the generation rules of cutting signals are different in uncut, cut in, cut out and stable states, the sampling points close to the edge of the workpiece are removed. Here we simply cut 2% off the two ends of the cutting signal obtained by each tool feed.2) Data amplificationBecause tool wear can only be observed with a microscope after each tool feeding, each wear tag corresponds to a cutting signal containing about 120000 sampling points, and the acquisition of tool wear also takes a lot of time. In this case, the number of tags extracted is not enough to fit the model, nor can the robustness of the algorithm be guaranteed. It is necessary to artificially split the sample and expand the tool wear label. Considering that the tool wear is a slow and continuous process, and there is a certain deviation in the experimental measurement, the linear interpolation method is adopted here. We also tested quadratic interpolation and polynomial fitting methods, but no better results were observed. It needs to be stated here that the essence of prediction is to find a function that maps the sample space to the target space. For any point in the sample space, the model can find the corresponding value in the target space. What sample amplification does is to sample more times in the target space, so as to more comprehensively describe this mapping relationship, rather than redefining this relationship.The task of this study is to monitor the wear of the rear cutter surface of the three teeth according to the six channel sensor signals. On the test set, the mean square error (MSE) and mean absolute percentage error (MAPE) between the predicted value and the observed value of the microscope are 0.0013 and 4%, respectively, and the average and maximum final prediction error (FPE) are 5 μ M and 23 μ m. The training time was 2130s, and the single prediction time was 1.79ms. The accuracy, training time and detection efficiency of tool wear monitoring can meet the current industrial needs. As MPAN realizes the mapping from cutting signal to tool wear, as the gate of control information flow, attention unit retains the importance information of input features. The predicted tool wear curve is basically consistent with the curve observed by the microscope.

  18. Z

    A Dataset of Outdoor RSS Measurements for Localization

    • data.niaid.nih.gov
    Updated Jul 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frost Mitchell; Aniqua Baset; Sneha Kumar Kasera; Aditya Bhaskara (2024). A Dataset of Outdoor RSS Measurements for Localization [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7259894
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    University of Utah
    Authors
    Frost Mitchell; Aniqua Baset; Sneha Kumar Kasera; Aditya Bhaskara
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Update: New version includes additional samples taken in November 2022.

    Dataset Description

    This dataset is a large-scale set of measurements for RSS-based localization. The data consists of received signal strength (RSS) measurements taken using the POWDER Testbed at the University of Utah. Samples include either 0, 1, or 2 active transmitters.

    The dataset consists of 5,214 unique samples, with transmitters in 5,514 unique locations. The majority of the samples contain only 1 transmitter, but there are small sets of samples with 0 or 2 active transmitters, as shown below. Each sample has RSS values from between 10 and 25 receivers. The majority of the receivers are stationary endpoints fixed on the side of buildings, on rooftop towers, or on free-standing poles. A small set of receivers are located on shuttles which travel specific routes throughout campus.

    Dataset Description Sample Count Receiver Count

    No-Tx Samples 46 10 to 25

    1-Tx Samples 4822 10 to 25

    2-Tx Samples 346 11 to 12

    The transmitters for this dataset are handheld walkie-talkies (Baofeng BF-F8HP) transmitting in the FRS/GMRS band at 462.7 MHz. These devices have a rated transmission power of 1 W. The raw IQ samples were processed through a 6 kHz bandpass filter to remove neighboring transmissions, and the RSS value was calculated as follows:

    (RSS = \frac{10}{N} \log_{10}\left(\sum_i^N x_i^2 \right) )

    Measurement Parameters Description

    Frequency 462.7 MHz

    Radio Gain 35 dB

    Receiver Sample Rate 2 MHz

    Sample Length N=10,000

    Band-pass Filter 6 kHz

    Transmitters 0 to 2

    Transmission Power 1 W

    Receivers consist of Ettus USRP X310 and B210 radios, and a mix of wide- and narrow-band antennas, as shown in the table below Each receiver took measurements with a receiver gain of 35 dB. However, devices have different maxmimum gain settings, and no calibration data was available, so all RSS values in the dataset are uncalibrated, and are only relative to the device.

    Usage Instructions

    Data is provided in .json format, both as one file and as split files.

    import json data_file = 'powder_462.7_rss_data.json' with open(data_file) as f: data = json.load(f)

    The json data is a dictionary with the sample timestamp as a key. Within each sample are the following keys:

    rx_data: A list of data from each receiver. Each entry contains RSS value, latitude, longitude, and device name.

    tx_coords: A list of coordinates for each transmitter. Each entry contains latitude and longitude.

    metadata: A list of dictionaries containing metadata for each transmitter, in the same order as the rows in tx_coords

    File Separations and Train/Test Splits

    In the separated_data.zip folder there are several train/test separations of the data.

    all_data contains all the data in the main JSON file, separated by the number of transmitters.

    stationary consists of 3 cases where a stationary receiver remained in one location for several minutes. This may be useful for evaluating localization using mobile shuttles, or measuring the variation in the channel characteristics for stationary receivers.

    train_test_splits contains unique data splits used for training and evaluating ML models. These splits only used data from the single-tx case. In other words, the union of each splits, along with unused.json, is equivalent to the file all_data/single_tx.json.

    The random split is a random 80/20 split of the data.

    special_test_cases contains the stationary transmitter data, indoor transmitter data (with high noise in GPS location), and transmitters off campus.

    The grid split divides the campus region in to a 10 by 10 grid. Each grid square is assigned to the training or test set, with 80 squares in the training set and the remainder in the test set. If a square is assigned to the test set, none of its four neighbors are included in the test set. Transmitters occuring in each grid square are assigned to train or test. One such random assignment of grid squares makes up the grid split.

    The seasonal split contains data separated by the month of collection, in April, July, or November

    The transportation split contains data separated by the method of movement for the transmitter: walking, cycling, or driving. The non-driving.json file contains the union of the walking and cycling data.

    campus.json contains the on-campus data, so is equivalent to the union of each split, not including unused.json.

    Digital Surface Model

    The dataset includes a digital surface model (DSM) from a State of Utah 2013-2014 LiDAR survey. This map includes the University of Utah campus and surrounding area. The DSM includes buildings and trees, unlike some digital elevation models.

    To read the data in python:

    import rasterio as rio import numpy as np import utm

    dsm_object = rio.open('dsm.tif') dsm_map = dsm_object.read(1) # a np.array containing elevation values dsm_resolution = dsm_object.res # a tuple containing x,y resolution (0.5 meters) dsm_transform = dsm_object.transform # an Affine transform for conversion to UTM-12 coordinates utm_transform = np.array(dsm_transform).reshape((3,3))[:2] utm_top_left = utm_transform @ np.array([0,0,1]) utm_bottom_right = utm_transform @ np.array([dsm_object.shape[0], dsm_object.shape[1], 1]) latlon_top_left = utm.to_latlon(utm_top_left[0], utm_top_left[1], 12, 'T') latlon_bottom_right = utm.to_latlon(utm_bottom_right[0], utm_bottom_right[1], 12, 'T')

    Dataset Acknowledgement: This DSM file is acquired by the State of Utah and its partners, and is in the public domain and can be freely distributed with proper credit to the State of Utah and its partners. The State of Utah and its partners makes no warranty, expressed or implied, regarding its suitability for a particular use and shall not be liable under any circumstances for any direct, indirect, special, incidental, or consequential damages with respect to users of this product.

    DSM DOI: https://doi.org/10.5069/G9TH8JNQ

  19. Skin Cancer Classification Images

    • kaggle.com
    zip
    Updated Dec 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rik (2024). Skin Cancer Classification Images [Dataset]. https://www.kaggle.com/datasets/rimkomatic/skin-cancer/discussion
    Explore at:
    zip(5195283401 bytes)Available download formats
    Dataset updated
    Dec 1, 2024
    Authors
    Rik
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Skin Cancer Classification Dataset

    Overview

    The Skin Cancer Classification Dataset is designed to support the development and evaluation of machine learning models for classifying skin cancer images into 8 distinct classes. This dataset provides a robust foundation for training, validating, and testing image classification models, particularly for deep learning frameworks.

    Features

    • Total Classes: 8 types of skin cancer.
    • Image Data: Preprocessed and standardized for efficient training.
    • Data Splits: The dataset is divided into:
      • Training Set
      • Validation Set
      • Test Set
    • File Format:
      • Features and labels are stored as pickle files:
      • train_x.pkl, train_y.pkl
      • val_x.pkl, val_y.pkl
      • test_x.pkl, test_y.pkl

    Dataset Structure

    SplitFeatures FileLabels FileDescription
    Trainingtrain_x.pkltrain_y.pklContains input features and labels for training
    Validationval_x.pklval_y.pklData used for model evaluation during training
    Testingtest_x.pkltest_y.pklData for final performance testing

    Input Details

    • Image Shape: (224, 224, 3) (Height, Width, Channels)
    • Label Encoding: One-hot or integer-encoded labels for 8 classes.

    Applications

    This dataset is ideal for: - Building deep learning models for multi-class image classification. - Experimenting with transfer learning and ensemble methods. - Developing tools for skin cancer detection in clinical applications.

    Instructions

    1. Loading Data

    The dataset is saved as pickle files for efficient storage and loading. Use the following Python code to load the data:

    import pickle
    
    # Example: Loading training data
    with open('train_x.pkl', 'rb') as f:
      train_x = pickle.load(f)
    
    with open('train_y.pkl', 'rb') as f:
      train_y = pickle.load(f)
    
    print("Training data loaded successfully!")
    

    Training a Model

    The dataset is compatible with popular deep learning frameworks like TensorFlow and PyTorch . Ensure to preprocess the data as per your model's requirements.

    Acknowledgements

    This dataset was prepared with the goal of aiding researchers and developers in advancing skin cancer detection technologies. Special thanks to all contributors and sources for the dataset's creation.

  20. LAIL

    • figshare.com
    zip
    Updated Jul 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jia Li (2024). LAIL [Dataset]. http://doi.org/10.6084/m9.figshare.22014596.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 30, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Jia Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LAILLAIL is a Large language model-Aware selection approach for In-context-Learning-based code generation named LAIL. LAIL uses LLMs themselves to select examples. It requires LLMs themselves to label a candidate example as a positive example or a negative example for a requirement.## Requirements- openai- tqdm- javaWe also privide a scripts (/Evaluation/evaluation_setup.sh) to help set up programming language dependencies that are used in evaluation.bashbash evaluation_setup.sh###### DatasetThe datasets contain DevEval, MBJP, MBPP, MBCPP, and HumanEval. DevEval is a repository-level code generation dataset, which is collected from real-word code repositories. The dataset aligns with real-world code repositories in multiple dimensions. Thus, we take DevEval as the example to demonstrate how to process the dataset. Take ../Dataset/DevEval as example:train.jsonl and test.jsonl:(1) We randomly select two domains to evaluate LAIL and baselines, including the scientific engineering domain and text processing domain. (2) We randomly split the tasks of the two domains into the training set and the test set. Finally, we acquire 101 examples in the training set and 49 examples in the test set. (3) Given a requirement from a repository, we use tree-sitter to parse the repository and acquire all functions of the repository. (4) We treat functions contained in the repository as the candidate pool. Then LAIL and baselines retrieve a few functions from thecandidate pool as demonstration examples. source data and test_source data folders consist of the original code repositories collected from Github.estimate_prompt folder contain the constructed prompts to estimate candidate examples.generation_prompt folder contains the constructed prompts where the demonstration examples are selected by LAIL and different baselines. For example:(1) ICL_LAIL folder provides the selected examples' id in LAIL_id by our LAIL. Developers can directly use these provided prompts through codellama_completion.py to generate programs. (2) After generating programs, developers need to process generated programs with process_generation.py. (3) Finally, developers evaluate the generated programs with the source code in Evaluation folder.############ LAIL ### Estimate candidate examples by LLMs themselvesWe leverage LLM themseleves to estimate candidate examples. The code is storaged in the LAIL/estimate_examples package.Take DevEval as example:(1) /Dataset/DevEval/estimate_prompt folder contains the constructed prompts to estimate candidate examples.(2) Developers run the following command to estimate candidate examples by CodeLlama-7B:bashbash make_estimation_prompt.sh ../Dataset/DevEval/estimation_prompt(3) According to the probability feedback of LLMs, we acquire the positive and negative examples.###### Train a neural retriever(1) We use the labeled positive and negative examples to train a neural retriever with contrastive learning. The code is storaged in the /LAIL/LAIL/retriever/train folder.bashexport CUDA_VISIBLE_DEVICES=0nohup python run.py \ --output_dir=/saved_models \ --model_type=roberta \ --config_name=microsoft/graphcodebert-base \ --model_name_or_path=microsoft/graphcodebert-base \ --tokenizer_name=microsoft/graphcodebert-base \ --do_train \ --train_data_file=/id.jsonl \ --epoch 100 \ --block_size 128 \ --train_batch_size 16 \ --learning_rate 1e-4 \ --max_grad_norm 1.0 \ --seed 123456 >mbpp.txt 2>&1 &## Select a few demonstration examples using the trained retriever(2) Given a test requirement, developers use the trained retriever to select a few demonstration examples.The code is storaged in the /LAIL/LAIL/retriever/train folder.bashbash run_inference.sh ../Dataset/DevEval###### Code Generation(1) After acquired the prompt context consisting of a few selected examples, developers input a test requirement and the prompt context into LLMs and acquire desired programs.For example, developers use CodeLlama ( ../LAIL/ICL_LAIL/codellama_completion.py) to generate programs:bashexport CUDA_VISIBLE_DEVICES=0torchrun --nproc_per_node=1 --master_port=16665 codellama_completion.py Salesforce/CodeLlama-7b ../Dataset/DevEval/prompt_LAIL.jsonl --temperature=0.8 --max_batch_size=4 --output_base=output_random --get_logits=False(2) After generating programs, developers need to process generated programs with ../LAIL/ICL_LAIL/process_generation.py. bashpython process_generation.py###### BaselinesThis paper contains seven baselines that use different approaches to select demonstration examples for ICL_based code generation.(1) The source code is in the baselines folder and each baseline is in a individual folder.Developers can acquire the selected examples of all baselines by runing source code as follows:bashpython baselines.py(2) Then, developers use /baselines/make_prompt.py to contruct a prompt context using the selected candidate examples as follows:bashpython make_prompt.py ICLCoder ICLCoder -1###### EvaluationIn this paper, we use Pass@k to evaluate the performances of LAIL and baselines by the source code in LAIL/EvaluationSince the DevEval dataset is a repository-level code generation which is complex to evaluate, developers can use the following pipeline to evaluate different approaches by the source code in /LAIL/Evaluation/.## CitationIf you have any questions or suggestions, please email us at lijiaa@pku.edu.cn.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Köhler, Juliane (2022). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6957841

Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft

Explore at:
Dataset updated
Aug 8, 2022
Authors
Köhler, Juliane
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

ger_train.csv – The German training set as CSV file.

ger_validation.csv – The German validation set as CSV file.

en_test.csv – The English test set as CSV file.

en_train.csv – The English training set as CSV file.

en_validation.csv – The English validation set as CSV file.

splitting.py – The python code for splitting a dataset into train, test and validation set.

DataSetTrans_de.csv – The final German dataset as a CSV file.

DataSetTrans_en.csv – The final English dataset as a CSV file.

translation.py – The python code for translating the cleaned dataset.

Search
Clear search
Close search
Google apps
Main menu