53 datasets found
  1. Z

    Data Cleaning, Translation & Split of the Dataset for the Automatic...

    • data.niaid.nih.gov
    Updated Aug 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Köhler, Juliane (2022). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6957841
    Explore at:
    Dataset updated
    Aug 8, 2022
    Dataset authored and provided by
    Köhler, Juliane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

    Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

    ger_train.csv – The German training set as CSV file.

    ger_validation.csv – The German validation set as CSV file.

    en_test.csv – The English test set as CSV file.

    en_train.csv – The English training set as CSV file.

    en_validation.csv – The English validation set as CSV file.

    splitting.py – The python code for splitting a dataset into train, test and validation set.

    DataSetTrans_de.csv – The final German dataset as a CSV file.

    DataSetTrans_en.csv – The final English dataset as a CSV file.

    translation.py – The python code for translating the cleaned dataset.

  2. e

    Industrial Machine Tool Element Surface Defect Dataset - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Jun 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Industrial Machine Tool Element Surface Defect Dataset - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/63d20a5e-3584-5096-a34d-d3f93fcc8857
    Explore at:
    Dataset updated
    Jun 2, 2024
    Description

    Using Machine Learning Techniques in general and Deep Learning techniques in specific needs a certain amount of data often not available in large quantities in some technical domains. The manual inspection of Machine Tool Components, as well as the manual end of line check of products, are labour intensive tasks in industrial applications that often want to be automated by companies. To automate the classification processes and to develop reliable and robust Machine Learning based classification and wear prognostics models there is a need for real-world datasets to train and test models on. The dataset contains 1104 channel 3 images with 394 image-annotations for the surface damage type “pitting”. The annotations made with the annotation tool labelme, are available in JSON format and hence convertible to VOC and COCO format. All images come from two BSD types. The dataset available for download is divided into two folders, data with all images as JPEG, label with all annotations, and saved_model with a baseline model. The authors also provide a python script to divide the data and labels into three different split types – train_test_split, which splits images into the same train and test data-split the authors used for the baseline model, wear_dev_split, which creates all 27 wear developments and type_split, which splits the data into the occurring BSD-types. One of the two mentioned BSD types is represented with 69 images and 55 different image-sizes. All images with this BSD type come either in a clean or soiled condition. The other BSD type is shown on 325 images with two image-sizes. Since all images of this type have been taken with continuous time the degree of soiling is evolving. Also, the dataset contains as above mentioned 27 pitting development sequences with every 69 images. Instruction dataset split The authors of this dataset provide 3 types of different dataset splits. To get the data split you have to run the python script split_dataset.py. Script inputs: split-type (mandatory) output directory (mandatory) Different split-types: train_test_split: splits dataset into train and test data (80%/20%) wear_dev_split: splits dataset into 27 wear-developments type_split: splits dataset into different BSD types Example: C:\Users\Desktop>python split_dataset.py --split_type=train_test_split --output_dir=BSD_split_folder

  3. Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

    • zenodo.org
    • data.europa.eu
    zip
    Updated Aug 24, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. http://doi.org/10.5281/zenodo.4571228
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 24, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    • The dataset is gathered on Sep. 17th 2020. It has more than 5.4K Python repositories that are hosted on GitHub. Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA.
    • The dataset is also de-duplicated using the CD4Py tool. The list of duplicate files is provided in duplicate_files.txt file.
    • All of its Python projects are processed in JSON-formatted files. They contain a seq2seq representation of each file, type-related hints, and information for machine learning models. The structure of JSON-formatted files is described in JSONOutput.md file.
    • The dataset is split into train, validation and test sets by source code files. The list of files and their corresponding set is provided in dataset_split.csv file.
    • Notable changes to each version of the dataset are documented in CHANGELOG.md.
  4. Dataset for Cost-effective Simulation-based Test Selection in Self-driving...

    • zenodo.org
    • data.niaid.nih.gov
    pdf, zip
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian Birchler; Nicolas Ganz; Sajad Khatiri; Alessio Gambi; Sebastiano Panichella; Christian Birchler; Nicolas Ganz; Sajad Khatiri; Alessio Gambi; Sebastiano Panichella (2024). Dataset for Cost-effective Simulation-based Test Selection in Self-driving Cars Software with SDC-Scissor [Dataset]. http://doi.org/10.5281/zenodo.5914130
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Christian Birchler; Nicolas Ganz; Sajad Khatiri; Alessio Gambi; Sebastiano Panichella; Christian Birchler; Nicolas Ganz; Sajad Khatiri; Alessio Gambi; Sebastiano Panichella
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SDC-Scissor tool for Cost-effective Simulation-based Test Selection in Self-driving Cars Software

    This dataset provides test cases for self-driving cars with the BeamNG simulator. Check out the repository and demo video to get started.

    GitHub: github.com/ChristianBirchler/sdc-scissor

    This project extends the tool competition platform from the Cyber-Phisical Systems Testing Competition which was part of the SBST Workshop in 2021.

    Usage

    Demo

    YouTube Link

    Installation

    The tool can either be run with Docker or locally using Poetry.

    When running the simulations a working installation of BeamNG.research is required. Additionally, this simulation cannot be run in a Docker container but must run locally.

    To install the application use one of the following approaches:

    • Docker: docker build --tag sdc-scissor .
    • Poetry: poetry install

    Using the Tool

    The tool can be used with the following two commands:

    • Docker: docker run --volume "$(pwd)/results:/out" --rm sdc-scissor [COMMAND] [OPTIONS] (this will write all files written to /out to the local folder results)
    • Poetry: poetry run python sdc-scissor.py [COMMAND] [OPTIONS]

    There are multiple commands to use. For simplifying the documentation only the command and their options are described.

    • Generation of tests:
      • generate-tests --out-path /path/to/store/tests
    • Automated labeling of Tests:
      • label-tests --road-scenarios /path/to/tests --result-folder /path/to/store/labeled/tests
      • Note: This only works locally with BeamNG.research installed
    • Model evaluation:
      • evaluate-models --dataset /path/to/train/set --save
    • Split train and test data:
      • split-train-test-data --scenarios /path/to/scenarios --train-dir /path/for/train/data --test-dir /path/for/test/data --train-ratio 0.8
    • Test outcome prediction:
      • predict-tests --scenarios /path/to/scenarios --classifier /path/to/model.joblib
    • Evaluation based on random strategy:
      • evaluate --scenarios /path/to/test/scenarios --classifier /path/to/model.joblib

    The possible parameters are always documented with --help.

    Linting

    The tool is verified the linters flake8 and pylint. These are automatically enabled in Visual Studio Code and can be run manually with the following commands:

    poetry run flake8 .
    poetry run pylint **/*.py

    License

    The software we developed is distributed under GNU GPL license. See the LICENSE.md file.

    Contacts

    Christian Birchler - Zurich University of Applied Science (ZHAW), Switzerland - birc@zhaw.ch

    Nicolas Ganz - Zurich University of Applied Science (ZHAW), Switzerland - gann@zhaw.ch

    Sajad Khatiri - Zurich University of Applied Science (ZHAW), Switzerland - mazr@zhaw.ch

    Dr. Alessio Gambi - Passau University, Germany - alessio.gambi@uni-passau.de

    Dr. Sebastiano Panichella - Zurich University of Applied Science (ZHAW), Switzerland - panc@zhaw.ch

    References

    • Christian Birchler, Nicolas Ganz, Sajad Khatiri, Alessio Gambi, and Sebastiano Panichella. 2022. Cost-effective Simulation-based Test Selection in Self-driving Cars Software with SDC-Scissor. In 2022 IEEE 29th International Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE.

    If you use this tool in your research, please cite the following papers:

    @INPROCEEDINGS{Birchler2022,
     author={Birchler, Christian and Ganz, Nicolas and Khatiri, Sajad and Gambi, Alessio, and Panichella, Sebastiano},
     booktitle={2022 IEEE 29th International Conference on Software Analysis, Evolution and Reengineering (SANER), 
     title={Cost-effective Simulationbased Test Selection in Self-driving Cars Software with SDC-Scissor}, 
     year={2022},
    }
  5. VegeNet - Image datasets and Codes

    • zenodo.org
    zip
    Updated Oct 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jo Yen Tan; Jo Yen Tan (2022). VegeNet - Image datasets and Codes [Dataset]. http://doi.org/10.5281/zenodo.7254508
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 27, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jo Yen Tan; Jo Yen Tan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Compilation of python codes for data preprocessing and VegeNet building, as well as image datasets (zip files).

    Image datasets:

    1. vege_original : Images of vegetables captured manually in data acquisition stage
    2. vege_cropped_renamed : Images in (1) cropped to remove background areas and image labels renamed
    3. non-vege images : Images of non-vegetable foods for CNN network to recognize other-than-vegetable foods
    4. food_image_dataset : Complete set of vege (2) and non-vege (3) images for architecture building.
    5. food_image_dataset_split : Image dataset (4) split into train and test sets
    6. process : Images created when cropping (pre-processing step) to create dataset (2).
  6. T

    ref_coco

    • tensorflow.org
    • opendatalab.com
    Updated May 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). ref_coco [Dataset]. https://www.tensorflow.org/datasets/catalog/ref_coco
    Explore at:
    Dataset updated
    May 31, 2024
    Description

    A collection of 3 referring expression datasets based off images in the COCO dataset. A referring expression is a piece of text that describes a unique object in an image. These datasets are collected by asking human raters to disambiguate objects delineated by bounding boxes in the COCO dataset.

    RefCoco and RefCoco+ are from Kazemzadeh et al. 2014. RefCoco+ expressions are strictly appearance based descriptions, which they enforced by preventing raters from using location based descriptions (e.g., "person to the right" is not a valid description for RefCoco+). RefCocoG is from Mao et al. 2016, and has more rich description of objects compared to RefCoco due to differences in the annotation process. In particular, RefCoco was collected in an interactive game-based setting, while RefCocoG was collected in a non-interactive setting. On average, RefCocoG has 8.4 words per expression while RefCoco has 3.5 words.

    Each dataset has different split allocations that are typically all reported in papers. The "testA" and "testB" sets in RefCoco and RefCoco+ contain only people and only non-people respectively. Images are partitioned into the various splits. In the "google" split, objects, not images, are partitioned between the train and non-train splits. This means that the same image can appear in both the train and validation split, but the objects being referred to in the image will be different between the two sets. In contrast, the "unc" and "umd" splits partition images between the train, validation, and test split. In RefCocoG, the "google" split does not have a canonical test set, and the validation set is typically reported in papers as "val*".

    Stats for each dataset and split ("refs" is the number of referring expressions, and "images" is the number of images):

    datasetpartitionsplitrefsimages
    refcocogoogletrain4000019213
    refcocogoogleval50004559
    refcocogoogletest50004527
    refcocounctrain4240416994
    refcocouncval38111500
    refcocounctestA1975750
    refcocounctestB1810750
    refcoco+unctrain4227816992
    refcoco+uncval38051500
    refcoco+unctestA1975750
    refcoco+unctestB1798750
    refcocoggoogletrain4482224698
    refcocoggoogleval50004650
    refcocogumdtrain4222621899
    refcocogumdval25731300
    refcocogumdtest50232600

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('ref_coco', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/ref_coco-refcoco_unc-1.1.0.png" alt="Visualization" width="500px">

  7. E

    Data from: Keyword extraction datasets for Croatian, Estonian, Latvian and...

    • live.european-language-grid.eu
    binary format
    Updated Jun 3, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Keyword extraction datasets for Croatian, Estonian, Latvian and Russian 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8369
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Jun 3, 2021
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    EACL Hackashop Keyword Challenge Datasets

    In this repository you can find ids of articles used for the keyword extraction challenge at

    EACL Hackashop on News Media Content Analysis and Automated Report Generation (http://embeddia.eu/hackashop2021/). The article ids can be used to generate train-test split used in paper:

    Koloski, B., Pollak, S., Škrlj, B., & Martinc, M. (2021). Extending Neural Keyword Extraction with TF-IDF tagset matching. In: Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, Kiev, Ukraine, pages 22–29.

    Train and test splits are provided for Latvian, Estonian, Russian and Croatian.

    The articles with the corresponding ID-s can be extracted from the following datasets:

    - Estonian and Russian (use the eearticles2015-2019 dataset): https://www.clarin.si/repository/xmlui/handle/11356/1408

    - Latvian: https://www.clarin.si/repository/xmlui/handle/11356/1409

    - Croatian: https://www.clarin.si/repository/xmlui/handle/11356/1410

    dataset_ids folder is organized in the following way:

    - latvian – containing latvian_train.json: a json file with ids from train articles to replicate the data used in Koloski et al. (2020), the latvian_test.json: a json file with ids from test articles to replicate the data

    - estonian – containing estonian_train.json: a json file with ids from train articles to replicate the data used in Koloski et al. (2020), the estonian_test.json: a json file with ids from test articles to replicate the data

    - russian – containing russian_train.json: a json file with ids from train articles to replicate the train data used in Koloski et al. (2020), the russian_test.json: a json file with ids from test articles to replicate the data

    - croatian - containing croatian_id_train.tsv file with sites and ids (note that just ids are not unique across dataset, therefore site information also needs to be included to obtain a unique article identifier) of articles in the train set, and the croatian_id_test.tsv file with sites and ids of articles in the test set.

    In addition, scripts are provided for extracting articles (see folder parse containing scripts parse.py and build_croatian_dataset.py, requirements for scripts are pandas and bs4 Python libraries):

    parse.py is used for extraction of Estonian, Russian and Latvian train and test datasets:

    Instructions:

    ESTONIAN-RUSSIAN

    1) Retrieve the data ee_articles_2015_2019.zip

    2) Create a folder 'data' and subfolder 'ee'

    3) Unzip them in the 'data/ee' folder

    To extract train/test Estonian articles:

    run function 'build_dataset(lang="ee", opt="nat")' in the parse.py script

    To extract train/test Russian articles:

    run function 'build_dataset(lang="ee", opt="rus")' in the parse.py script

    LATVIAN:

    1) Retrieve the latvian data

    2) Unzip it in 'data/lv' folder

    3) To extract train/test Latvian articles:

    run function 'build_dataset(lang="lv", opt="nat")' in the parse.py script

    build_croatian_dataset.py is used for extraction of Croatian train and test datasets:

    Instructions:

    CROATIAN:

    1) Retrieve the Croatian data (file 'STY_24sata_articles_hr_PUB-01.csv')

    2) put the script 'build_croatian_dataset.py' in the same folder as the extracted data and run it (e.g., python build_croatian_dataset.py).

    For additional questions: {Boshko.Koloski,Matej.Martinc,Senja.Pollak}@ijs.si

  8. r

    Data from: JSON Dataset of Simulated Building Heat Control for System of...

    • researchdata.se
    • gimi9.com
    • +1more
    Updated Mar 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Nilsson (2025). JSON Dataset of Simulated Building Heat Control for System of Systems Interoperability [Dataset]. http://doi.org/10.5878/e5hb-ne80
    Explore at:
    (438755370), (110041420), (156812), (5417)Available download formats
    Dataset updated
    Mar 21, 2025
    Dataset provided by
    Luleå University of Technology
    Authors
    Jacob Nilsson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Luleå Municipality
    Description

    Interoperability in systems-of-systems is a difficult problem due to the abundance of data standards and formats. Current approaches to interoperability rely on hand-made adapters or methods using ontological metadata. This dataset was created to facilitate research on data-driven interoperability solutions. The data comes from a simulation of a building heating system, and the messages sent within control systems-of-systems. For more information see attached data documentation.

    The data comes in two semicolon-separated (;) csv files, training.csv and test.csv. The train/test split is not random; training data comes from the first 80% of simulated timesteps, and the test data is the last 20%. There is no specific validation dataset, the validation data should instead be randomly selected from the training data. The simulation runs for as many time steps as there are outside temperature values available. The original SMHI data only samples once every hour, which we linearly interpolate to get one temperature sample every ten seconds. The data saved at each time step consists of 34 JSON messages (four per room and two temperature readings from the outside), 9 temperature values (one per room and outside), 8 setpoint values, and 8 actuator outputs. The data associated with each of those 34 JSON-messages is stored as a single row in the tables. This means that much data is duplicated, a choice made to make it easier to use the data.

    The simulation data is not meant to be opened and analyzed in spreadsheet software, it is meant for training machine learning models. It is recommended to open the data with the pandas library for Python, available at https://pypi.org/project/pandas/.

    The data file with temperatures (smhi-july-23-29-2018.csv) acts as input for the thermodynamic building simulation found on Github, where it is used to get the outside temperature and corresponding timestamps. Temperature data for Luleå Summer 2018 were downloaded from SMHI.

  9. h

    sft-python-q-problems

    • huggingface.co
    Updated Aug 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Morgan Stanley (2025). sft-python-q-problems [Dataset]. https://huggingface.co/datasets/morganstanley/sft-python-q-problems
    Explore at:
    Dataset updated
    Aug 31, 2025
    Dataset authored and provided by
    Morgan Stanley
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    SFT Python-Q Programming Problems Dataset

    This dataset contains programming problems with solutions in both Python and Q programming languages, designed for supervised fine-tuning of code generation models.

      📊 Dataset Overview
    

    Total Problems: 678 unique programming problems Train Split: 542 problems
    Test Split: 136 problems Languages: Python and Q Source: LeetCode-style algorithmic problems Format: Multiple data formats for different use cases

      🎯 Key Features… See the full description on the dataset page: https://huggingface.co/datasets/morganstanley/sft-python-q-problems.
    
  10. t

    Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...

    • researchdata.tuwien.at
    • b2find.eudat.eu
    html, pdf, zip
    Updated Mar 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi (2025). Decoding Wayfinding: Analyzing Wayfinding Processes in the Outdoor Environment [Dataset]. http://doi.org/10.48436/m2ha4-t1v92
    Explore at:
    html, zip, pdfAvailable download formats
    Dataset updated
    Mar 19, 2025
    Dataset provided by
    TU Wien
    Authors
    Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    How To Cite?

    Alinaghi, N., Giannopoulos, I., Kattenbeck, M., & Raubal, M. (2025). Decoding wayfinding: analyzing wayfinding processes in the outdoor environment. International Journal of Geographical Information Science, 1–31. https://doi.org/10.1080/13658816.2025.2473599

    Link to the paper: https://www.tandfonline.com/doi/full/10.1080/13658816.2025.2473599

    Folder Structure

    The folder named “submission” contains the following:

    1. “pythonProject”: This folder contains all the Python files and subfolders needed for analysis.
    2. ijgis.yml: This file lists all the Python libraries and dependencies required to run the code.

    Setting Up the Environment

    1. Use the ijgis.yml file to create a Python project and environment. Ensure you activate the environment before running the code.
    2. The pythonProject folder contains several .py files and subfolders, each with specific functionality as described below.

    Subfolders

    1. Data_4_IJGIS

    • This folder contains the data used for the results reported in the paper.
    • Note: The data analysis that we explain in this paper already begins with the synchronization and cleaning of the recorded raw data. The published data is already synchronized and cleaned. Both the cleaned files and the merged files with features extracted for them are given in this directory. If you want to perform the segmentation and feature extraction yourself, you should run the respective Python files yourself. If not, you can use the “merged_…csv” files as input for the training.

    2. results_[DateTime] (e.g., results_20240906_15_00_13)

    • This folder will be generated when you run the code and will store the output of each step.
    • The current folder contains results created during code debugging for the submission.
    • When you run the code, a new folder with fresh results will be generated.

    Python Files

    1. helper_functions.py

    • Contains reusable functions used throughout the analysis.
    • Each function includes a description of its purpose and the input parameters required.

    2. create_sanity_plots.py

    • Generates scatter plots like those in Figure 3 of the paper.
    • Although the code has been run for all 309 trials, it can be used to check the sample data provided.
    • Output: A .png file for each column of the raw gaze and IMU recordings, color-coded with logged events.
    • Usage: Run this file to create visualizations similar to Figure 3.

    3. overlapping_sliding_window_loop.py

    • Implements overlapping sliding window segmentation and generates plots like those in Figure 4.
    • Output:
      • Two new subfolders, “Gaze” and “IMU”, will be added to the Data_4_IJGIS folder.
      • Segmented files (default: 2–10 seconds with a 1-second step size) will be saved as .csv files.
      • A visualization of the segments, similar to Figure 4, will be automatically generated.

    4. gaze_features.py & imu_features.py (Note: there has been an update to the IDT function implementation in the gaze_features.py on 19.03.2025.)

    • These files compute features as explained in Tables 1 and 2 of the paper, respectively.
    • They process the segmented recordings generated by the overlapping_sliding_window_loop.py.
    • Usage: Just to know how the features are calculated, you can run this code after the segmentation with the sliding window and run these files to calculate the features from the segmented data.

    5. training_prediction.py

    • This file contains the main machine learning analysis of the paper. This file contains all the code for the training of the model, its evaluation, and its use for the inference of the “monitoring part”. It covers the following steps:
    a. Data Preparation (corresponding to Section 5.1.1 of the paper)
    • Prepares the data according to the research question (RQ) described in the paper. Since this data was collected with several RQs in mind, we remove parts of the data that are not related to the RQ of this paper.
    • A function named plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5)) in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line.
    b. Training/Validation/Test Split
    • Splits the data for machine learning experiments (an explanation can be found in Section 5.1.1. Preparation of data for training and inference of the paper).
    • Make sure that you follow the instructions in the comments to the code exactly.
    • Output: The split data is saved as .csv files in the results folder.
    c. Machine and Deep Learning Experiments

    This part contains three main code blocks:

    iii. One for the XGboost code with correct hyperparameter tuning:
    Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of

    • MLP Network (Commented Out): This code was used for classification with the MLP network, and the results shown in Table 3 are from this code. If you wish to use this model, please comment out the following blocks accordingly.
    • XGBoost without Hyperparameter Tuning: If you want to run the code but do not want to spend time on the full training with hyperparameter tuning (as was done for the paper), just uncomment this part. This will give you a simple, untuned model with which you can achieve at least some results.
    • XGBoost with Hyperparameter Tuning: If you want to train the model the way we trained it for the analysis reported in the paper, use this block (the plots in Figure 7 are from this block). We ran this block with different feature sets and different segmentation files and created a simple bar chart from the saved results, shown in Figure 6.

    Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.

    d. Inference (Monitoring Part)
    • Final inference is performed using the monitoring data. This step produces a .csv file containing inferred labels.
    • Figure 8 in the paper is generated using this part of the code.

    6. sequence_analysis.py

    • Performs analysis on the inferred data, producing Figures 9 and 10 from the paper.
    • This file reads the inferred data from the previous step and performs sequence analysis as described in Sections 5.2.1 and 5.2.2.

    Licenses

    The data is licensed under CC-BY, the code is licensed under MIT.

  11. e

    JSON dataset för simulerad byggnadsvärmekontroll för system-av-system...

    • b2find.eudat.eu
    Updated Apr 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). JSON dataset för simulerad byggnadsvärmekontroll för system-av-system interoperabilitet - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/442bc87f-092d-57d9-a2f0-ba1c7e049d36
    Explore at:
    Dataset updated
    Apr 19, 2022
    Description

    Interoperability in systems-of-systems is a difficult problem due to the abundance of data standards and formats. Current approaches to interoperability rely on hand-made adapters or methods using ontological metadata. This dataset was created to facilitate research on data-driven interoperability solutions. The data comes from a simulation of a building heating system, and the messages sent within control systems-of-systems. For more information see attached data documentation. The data comes in two semicolon-separated (;) csv files, training.csv and test.csv. The train/test split is not random; training data comes from the first 80% of simulated timesteps, and the test data is the last 20%. There is no specific validation dataset, the validation data should instead be randomly selected from the training data. The simulation runs for as many time steps as there are outside temperature values available. The original SMHI data only samples once every hour, which we linearly interpolate to get one temperature sample every ten seconds. The data saved at each time step consists of 34 JSON messages (four per room and two temperature readings from the outside), 9 temperature values (one per room and outside), 8 setpoint values, and 8 actuator outputs. The data associated with each of those 34 JSON-messages is stored as a single row in the tables. This means that much data is duplicated, a choice made to make it easier to use the data. The simulation data is not meant to be opened and analyzed in spreadsheet software, it is meant for training machine learning models. It is recommended to open the data with the pandas library for Python, available at https://pypi.org/project/pandas/. Datasetet innehåller simulerad servicedata för system-av-system interoperabilitetsforskning. För mer information se bifogad dokumentation och den engelska katalogsidan. Data kommer i två semikolonseparerade (;) csv-filer, training.csv och test.csv. Träning/testfördelningen är inte slumpmässig; träningsdata kommer från de första 80 % av de simulerade tidsstegen och testdata är de sista 20 %. Det finns ingen specifik valideringsdatauppsättning, valideringsdatan bör istället väljas slumpmässigt från träningsdatan. Simuleringen körs i lika många tidssteg som det finns tillgängliga utetemperaturvärden. De ursprungliga SMHI-data samplar bara en gång i timmen, som linjärt interpolerar för att få ett temperaturprov var tionde sekund. Data som sparas vid varje tidssteg består av 34 JSON-meddelanden (fyra per rum och två temperaturavläsningar utifrån), 9 temperaturvärden (ett per rum och utanför), 8 börvärden och 8 ställdonutgångar. Data som är associerade med vart och ett av dessa 34 JSON-meddelanden lagras som en enda rad i tabellerna. Detta innebär att mycket data dupliceras, ett val som görs för att göra det lättare att använda datan. Simuleringsdata är inte avsedd att öppnas och analyseras i kalkylprogram, det är avsett att träna maskininlärningsmodeller. Det rekommenderas att öppna data med pandas-biblioteket för Python, tillgängligt på https://pypi.org/project/pandas/. Building temperature simulation. Simulering av byggnadstemperatur. Simulation

  12. d

    MC-LSTM papers, model runs

    • search.dataone.org
    • hydroshare.org
    Updated Dec 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan Martin Frame (2023). MC-LSTM papers, model runs [Dataset]. http://doi.org/10.4211/hs.d750278db868447dbd252a8c5431affd
    Explore at:
    Dataset updated
    Dec 30, 2023
    Dataset provided by
    Hydroshare
    Authors
    Jonathan Martin Frame
    Time period covered
    Jan 1, 1989 - Jan 1, 2015
    Area covered
    Description

    Runs from two papers exploring the use of mass conserving LSTM. Model results used in the papers are 1) model_outputs_for_analysis_extreme_events_paper.tar.gz, and 2) model_outputs_for_analysis_mass_balance_paper.tar.gz.

    The models here are trained/calibrated on three different time periods. Standard Time Split (time split 1): test period(1989-1999) is the same period used by previous studies which allows us to confirm that the deep learning models (LSTM andMC-LSTM) trained for this project perform as expected relative to prior work. NWM Time Split (time split 2): The second test period (1995-2014) allows us to benchmark against the NWM-Rv2, which does not provide data prior to 1995. Return period split: The third test period (based on return periods) allows us to benchmark only on water years that contain streamflow events that are larger (per basin) than anything seen in the training data (<= 5-year return periods in training and > 5-year return periods in testing).

    Also included are an ensemble of model runs for LSTM, MC-LSTM for the "standard" training period and two forcing products. These files are provided in the format "

    IMPORTANT NOTE: This python environment should be used to extract and load the data: https://github.com/jmframe/mclstm_2021_extrapolate/blob/main/python_environment.yml, as the pickle files serialized the data with specific versions of python libraries. Specifically, the pickle serialization was done with xarray=0.16.1.

    Code to interpret these runs can be found here: https://github.com/jmframe/mclstm_2021_extrapolate https://github.com/jmframe/mclstm_2021_mass_balance

    Papers are available here: https://hess.copernicus.org/preprints/hess-2021-423/

  13. STEAD subsample 4 CDiffSD

    • zenodo.org
    bin
    Updated Apr 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniele Trappolini; Daniele Trappolini (2024). STEAD subsample 4 CDiffSD [Dataset]. http://doi.org/10.5281/zenodo.11094536
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 30, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Daniele Trappolini; Daniele Trappolini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 15, 2024
    Description

    STEAD Subsample Dataset for CDiffSD Training

    Overview

    This dataset is a subsampled version of the STEAD dataset, specifically tailored for training our CDiffSD model (Cold Diffusion for Seismic Denoising). It consists of four HDF5 files, each saved in a format that requires Python's `h5py` method for opening.

    Dataset Files

    The dataset includes the following files:

    • train: Used for both training and validation phases (with validation train split). Contains earthquake ground truth traces.
    • noise_train: Used for both training and validation phases. Contains noise used to contaminate the traces.
    • test: Used for the testing phase, structured similarly to train.
    • noise_test: Used for the testing phase, contains noise data for testing.

    Each file is structured to support the training and evaluation of seismic denoising models.

    Data

    The HDF5 files named noise contain two main datasets:

    • traces: This dataset includes N number of events, with each event being 6000 in size, representing the length of the traces. Each trace is organized into three channels in the following order: E (East-West), N (North-South), Z (Vertical).
    • metadata: This dataset contains the names of the traces for each event.

    Similarly, the train and test files, which contain earthquake data, include the same traces and metadata datasets, but also feature two additional datasets:

    • p_arrival: Contains the arrival indices of P-waves, expressed in counts.
    • s_arrival: Contains the arrival indices of S-waves, also expressed in counts.


    Usage

    To load these files in a Python environment, use the following approach:

    ```python

    import h5py
    import numpy as np

    # Open the HDF5 file in read mode
    with h5py.File('train_noise.hdf5', 'r') as file:
    # Print all the main keys in the file
    print("Keys in the HDF5 file:", list(file.keys()))

    if 'traces' in file:
    # Access the dataset
    data = file['traces'][:10] # Load the first 10 traces

    if 'metadata' in file:
    # Access the dataset
    trace_name = file['metadata'][:10] # Load the first 10 metadata entries```

    Ensure that the path to the file is correctly specified relative to your Python script.

    Requirements

    To use this dataset, ensure you have Python installed along with the Pandas library, which can be installed via pip if not already available:

    ```bash
    pip install numpy
    pip install h5py
    ```

  14. T

    imagenet2012

    • tensorflow.org
    Updated Jun 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). imagenet2012 [Dataset]. https://www.tensorflow.org/datasets/catalog/imagenet2012
    Explore at:
    Dataset updated
    Jun 1, 2024
    Description

    ILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). In ImageNet, we aim to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, we hope ImageNet will offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy.

    The test split contains 100K images but no labels because no labels have been publicly released. We provide support for the test split from 2012 with the minor patch released on October 10, 2019. In order to manually download this data, a user must perform the following operations:

    1. Download the 2012 test split available here.
    2. Download the October 10, 2019 patch. There is a Google Drive link to the patch provided on the same page.
    3. Combine the two tar-balls, manually overwriting any images in the original archive with images from the patch. According to the instructions on image-net.org, this procedure overwrites just a few images.

    The resulting tar-ball may then be processed by TFDS.

    To assess the accuracy of a model on the ImageNet test split, one must run inference on all images in the split, export those results to a text file that must be uploaded to the ImageNet evaluation server. The maintainers of the ImageNet evaluation server permits a single user to submit up to 2 submissions per week in order to prevent overfitting.

    To evaluate the accuracy on the test split, one must first create an account at image-net.org. This account must be approved by the site administrator. After the account is created, one can submit the results to the test server at https://image-net.org/challenges/LSVRC/eval_server.php The submission consists of several ASCII text files corresponding to multiple tasks. The task of interest is "Classification submission (top-5 cls error)". A sample of an exported text file looks like the following:

    771 778 794 387 650
    363 691 764 923 427
    737 369 430 531 124
    755 930 755 59 168
    

    The export format is described in full in "readme.txt" within the 2013 development kit available here: https://image-net.org/data/ILSVRC/2013/ILSVRC2013_devkit.tgz Please see the section entitled "3.3 CLS-LOC submission format". Briefly, the format of the text file is 100,000 lines corresponding to each image in the test split. Each line of integers correspond to the rank-ordered, top 5 predictions for each test image. The integers are 1-indexed corresponding to the line number in the corresponding labels file. See labels.txt.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('imagenet2012', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/imagenet2012-5.1.0.png" alt="Visualization" width="500px">

  15. e

    Damped pendulum for nonlinear system identification - inputs are sampled...

    • b2find.eudat.eu
    Updated Jul 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Damped pendulum for nonlinear system identification - inputs are sampled from a multivariate-normal distribution - synthetically generated - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/a411d807-9ffa-54f8-9f12-4646791cbda4
    Explore at:
    Dataset updated
    Jul 21, 2025
    Description

    Overview This dataset contains input-output data of a damped nonlinear pendulum that is actuated at the mounting point. The data was generated with statesim [1], a python package for simulating linear and nonlinear ODEs, for the system actuated pendulum. The configuration .json files for the corresponding datasets (in-distribution and out-of-distribution) can be found in the respective folders. After creating the dataset, the files are stored in the raw folder. Then, they are split into subsets for training, testing, and validation and can be found in the processed folder; details about the splitting are found in the config.json file. The dataset can be used to test system identification algorithms and methods that aim to identify nonlinear dynamics from input-output measurements. The training dataset is used to optimize the model parameters, the validation set for hyperparameter optimization, and the test set only for the final evaluation. In [2], the authors used the same underlying dynamics to create their dataset but without damping terms. Input generation Input trajectories are sampled from a multivariate-normal distribution. Noise Gaussian white noise of approximately 30dB is added at the output. Statistics The input and output size is one. In-distribution data: 2 100 000 data points Training: 10 000 trajectories of length 150 Validation: 2 000 trajectories of length 150 Test: 2 000 trajectories of length 150 Out-of-distribution data: 7 times 100 000 data points 7 different datasets were only used for testing. Each dataset contains 200 trajectories of length 500. References Frank, D. statesim [Computer software]. https://github.com/Dany-L/statesim Lu, L., Jin, P., Pang, G., Zhang, Z., & Karniadakis, G. E. (2021). Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators. Nature machine intelligence, 3(3), 218-229.

  16. CYP450 80/20 splits

    • figshare.com
    txt
    Updated Jan 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Siegle (2016). CYP450 80/20 splits [Dataset]. http://doi.org/10.6084/m9.figshare.1066108.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 19, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Daniel Siegle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data from an NIH HTS of 17K compounds against five isozymes of cytochrome P450 screening for inhibition. The activity score is taken from the NIH assay and merged with all the 2-D descriptors from the program Molecular Operating Environment (MOE). The datasets are separated by isozyme and then balanced between actives and inactives. Finally the balanced datasets are subject to an 80/20 training/test split. Link to python script of data manipulation...

  17. Monkeypox Skin Lesion Dataset

    • kaggle.com
    Updated Jul 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TensorKitty (2022). Monkeypox Skin Lesion Dataset [Dataset]. https://www.kaggle.com/datasets/nafin59/monkeypox-skin-lesion-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 5, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    TensorKitty
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    An updated version of the MSLD dataset, MSLD v2.0 has been released after being verified by an expert dermatologist!

    For details, check our GitHub repo!

    Context

    The recent monkeypox outbreak has become a global healthcare concern owing to its rapid spread in more than 65 countries around the globe. To obstruct its expeditious pace, early diagnosis is a must. But the confirmatory Polymerase Chain Reaction (PCR) tests and other biochemical assays are not readily available in sufficient quantities. In this scenario, computer-aided monkeypox identification from skin lesion images can be a beneficial measure. Nevertheless, so far, such datasets are not available. Hence, the "Monkeypox Skin Lesion Dataset (MSLD)" is created by collecting and processing images from different means of web-scrapping i.e., from news portals, websites and publicly accessible case reports.

    The creation of "Monkeypox Image Lesion Dataset" is primarily focused on distinguishing the monkeypox cases from the similar non-monkeypox cases. Therefore, along with the 'Monkeypox' class, we included skin lesion images of 'Chickenpox' and 'Measles' because of their resemblance to the monkeypox rash and pustules in initial state in another class named 'Others' to perform binary classification.

    Content

    There are 3 folders in the dataset.

    1) Original Images: It contains a total number of 228 images, among which 102 belongs to the 'Monkeypox' class and the remaining 126 represents the 'Others' class i.e., non-monkeypox (chickenpox and measles) cases.

    2) Augmented Images: To aid the classification task, several data augmentation methods such as rotation, translation, reflection, shear, hue, saturation, contrast and brightness jitter, noise, scaling etc. have been applied using MATLAB R2020a. Although this can be readily done using ImageGenerator/other image augmentors, to ensure reproducibility of the results, the augmented images are provided in this folder. Post-augmentation, the number of images increased by approximately 14-folds. The classes 'Monkeypox' and 'Others' have 1428 and 1764 images, respectively.

    3) Fold1: One of the three-fold cross validation datasets. To avoid any sort of bias in training, three-fold cross validation was performed. The original images were split into training, validation and test set(s) with the approximate proportion of 70 : 10 : 20 while maintaining patient independence. According to the commonly perceived data preparation practice, only the training and validation images were augmented while the test set contained only the original images. Users have the option of using the folds directly or using the original data and employing other algorithms to augment it.

    Additionally, a CSV file is provided that has 228 rows and two columns. The table contains the list of all the ImageID(s) with their corresponding label.

    Web Application

    Since monkeypox is demonstrating a very rapid community transmission pattern, a consumer-level software is truly necessary to increase awareness and encourage people to take rapid action. We have developed an easy-to-use web application named Monkey Pox Detector using the open-source python streamlit framework that uses our trained model to address this issue. It makes predictions on whether or not to see a specialist along with the prediction accuracy. Future updates will benefit from the user data we continue to collect and use to improve our model. The web app has a flask core, so that it can be deployed cross-platform in the future.

    Learn more at our GitHub repo!

    Citation

    If this dataset helped your research, please cite the following articles:

    Ali, S. N., Ahmed, M. T., Paul, J., Jahan, T., Sani, S. M. Sakeef, Noor, N., & Hasan, T. (2022). Monkeypox Skin Lesion Detection Using Deep Learning Models: A Preliminary Feasibility Study. arXiv preprint arXiv:2207.03342.

    @article{Nafisa2022, title={Monkeypox Skin Lesion Detection Using Deep Learning Models: A Preliminary Feasibility Study}, author={Ali, Shams Nafisa and Ahmed, Md. Tazuddin and Paul, Joydip and Jahan, Tasnim and Sani, S. M. Sakeef and Noor, Nawshaba and Hasan, Taufiq}, journal={arXiv preprint arXiv:2207.03342}, year={2022} }

    Ali, S. N., Ahmed, M. T., Jahan, T., Paul, J., Sani, S. M. Sakeef, Noor, N., Asma, A. N., & Hasan, T. (2023). A Web-based Mpox Skin Lesion Detection System Using State-of-the-art Deep Learning Models Considering Racial Diversity. arXiv preprint arXiv:2306.14169.

    @article{Nafisa2023, title={A Web-base...

  18. T

    cardiotox

    • tensorflow.org
    Updated Dec 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). cardiotox [Dataset]. https://www.tensorflow.org/datasets/catalog/cardiotox
    Explore at:
    Dataset updated
    Dec 1, 2021
    Description

    Drug Cardiotoxicity dataset [1-2] is a molecule classification task to detect cardiotoxicity caused by binding hERG target, a protein associated with heart beat rhythm. The data covers over 9000 molecules with hERG activity.

    Note:

    1. The data is split into four splits: train, test-iid, test-ood1, test-ood2.

    2. Each molecule in the dataset has 2D graph annotations which is designed to facilitate graph neural network modeling. Nodes are the atoms of the molecule and edges are the bonds. Each atom is represented as a vector encoding basic atom information such as atom type. Similar logic applies to bonds.

    3. We include Tanimoto fingerprint distance (to training data) for each molecule in the test sets to facilitate research on distributional shift in graph domain.

    For each example, the features include: atoms: a 2D tensor with shape (60, 27) storing node features. Molecules with less than 60 atoms are padded with zeros. Each atom has 27 atom features. pairs: a 3D tensor with shape (60, 60, 12) storing edge features. Each edge has 12 edge features. atom_mask: a 1D tensor with shape (60, ) storing node masks. 1 indicates the corresponding atom is real, othewise a padded one. pair_mask: a 2D tensor with shape (60, 60) storing edge masks. 1 indicates the corresponding edge is real, othewise a padded one. active: a one-hot vector indicating if the molecule is toxic or not. [0, 1] indicates it's toxic, otherwise [1, 0] non-toxic.

    References

    [1]: V. B. Siramshetty et al. Critical Assessment of Artificial Intelligence Methods for Prediction of hERG Channel Inhibition in the Big Data Era. JCIM, 2020. https://pubs.acs.org/doi/10.1021/acs.jcim.0c00884

    [2]: K. Han et al. Reliable Graph Neural Networks for Drug Discovery Under Distributional Shift. NeurIPS DistShift Workshop 2021. https://arxiv.org/abs/2111.12951

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('cardiotox', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  19. e

    Data for binary classification experiments - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Aug 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Data for binary classification experiments - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/d06f6716-2ae6-56d5-abbe-e8526da23582
    Explore at:
    Dataset updated
    Aug 16, 2025
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Predicting spatial familiarity by exploiting head and eye movements during pedestrian navigation in the real world This paper will be published in Springer Nature Scientific Reports. File overview The structure of the archive is the following: Folder "01_data" contains all the data files needed and a readme file describing the structure of each of these data files. These data files are: lsp.csv [contains demographic data about participants] matched_gaze_imu.csv [contains the segmented behavioral data, i.e. both gaze features and imu features] matched_gaze_imu_feature_description.pdf [contains a description of the features contained in matched_gaze_imu.csv] walking_dates.csv [contains an overview on which date participants walked the familiar and unfamiliar routes] users_polygons.csv [contains one or more polygons per participant in which they are familiar] polygons_markers.csv [contains locations of POIs per polygon for which participants reported to be familiar with] user_routes.csv [containes the route participants provided between a randomly selected pair of POIs they have provided for a given polygon] Folder "02_scripts" contains the data analysis scripts; they are organized in two subfolders: 01_ml_scripts: these are the scripts for the XGBoost classification; they are organized as two python files in which further instructions for use are given. 80_20_code.py is the python file which runs the ML experiments using an 80/20 train/test split L5O4T_code.py is the python file which runs the ML experiments leaving the full data of five different participants per condition as unseen data for the test. requirements.txt states the used Python package versions 02_r_scripts: cleaned_script.Rmd This is an R notebook which can be easily opened in R-Studio and provides the analysis scripts for the descriptive statistics presented in the paper.

  20. t

    Tour Recommendation Model

    • test.researchdata.tuwien.at
    • test.researchdata.tuwien.ac.at
    bin, png +1
    Updated May 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar (2025). Tour Recommendation Model [Dataset]. http://doi.org/10.70124/akpf6-8p175
    Explore at:
    text/markdown, png, binAvailable download formats
    Dataset updated
    May 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 28, 2025
    Description

    Dataset Description for Tour Recommendation Model

    Context and Methodology:

    • Research Domain/Project:
      This dataset is part of the Tour Recommendation System project, which focuses on predicting user preferences and ratings for various tourist places and events. It belongs to the field of Machine Learning, specifically applied to Recommender Systems and Predictive Analytics.

    • Purpose:
      The dataset serves as the training and evaluation data for a Decision Tree Regressor model, which predicts ratings (from 1-5) for different tourist destinations based on user preferences. The model can be used to recommend places or events to users based on their predicted ratings.

    • Creation Methodology:
      The dataset was originally collected from a tourism platform where users rated various tourist places and events. The data was preprocessed to remove missing or invalid entries (such as #NAME? in rating columns). It was then split into subsets for training, validation, and testing the model.

    Technical Details:

    • Structure of the Dataset:
      The dataset is stored as a CSV file (user_ratings_dataset.csv) and contains the following columns:

      • place_or_event_id: Unique identifier for each tourist place or event.

      • rating: Rating given by the user, ranging from 1 to 5.

      The data is split into three subsets:

      • Training Set: 80% of the dataset used to train the model.

      • Validation Set: A small portion used for hyperparameter tuning.

      • Test Set: 20% used to evaluate model performance.

    • Folder and File Naming Conventions:
      The dataset files are stored in the following structure:

      • user_ratings_dataset.csv: The original dataset file containing user ratings.

      • tour_recommendation_model.pkl: The saved model after training.

      • actual_vs_predicted_chart.png: A chart comparing actual and predicted ratings.

    • Software Requirements:
      To open and work with this dataset, the following software and libraries are required:

      • Python 3.x

      • Pandas for data manipulation

      • Scikit-learn for training and evaluating machine learning models

      • Matplotlib for chart generation

      • Joblib for saving and loading the trained model

      The dataset can be opened and processed using any Python environment that supports these libraries.

    • Additional Resources:

      • The model training code, README file, and performance chart are available in the project repository.

      • For detailed explanation and code, please refer to the GitHub repository (or any other relevant link for the code).

    Further Details:

    • Dataset Reusability:
      The dataset is structured for easy use in training machine learning models for recommendation systems. Researchers and practitioners can utilize it to:

      • Train other types of models (e.g., regression, classification).

      • Experiment with different features or add more metadata to enrich the dataset.

    • Data Integrity:
      The dataset has been cleaned and preprocessed to remove invalid values (such as #NAME? or missing ratings). However, users should ensure they understand the structure and the preprocessing steps taken before reusing it.

    • Licensing:
      The dataset is provided under the CC BY 4.0 license, which allows free usage, distribution, and modification, provided that proper attribution is given.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Köhler, Juliane (2022). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6957841

Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft

Explore at:
Dataset updated
Aug 8, 2022
Dataset authored and provided by
Köhler, Juliane
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

ger_train.csv – The German training set as CSV file.

ger_validation.csv – The German validation set as CSV file.

en_test.csv – The English test set as CSV file.

en_train.csv – The English training set as CSV file.

en_validation.csv – The English validation set as CSV file.

splitting.py – The python code for splitting a dataset into train, test and validation set.

DataSetTrans_de.csv – The final German dataset as a CSV file.

DataSetTrans_en.csv – The final English dataset as a CSV file.

translation.py – The python code for translating the cleaned dataset.

Search
Clear search
Close search
Google apps
Main menu