42 datasets found
  1. Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

    GitHub page: https://github.com/soarsmu/NICHE

  2. UCI and OpenML Data Sets for Ordinal Quantification

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jul 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

    With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

    We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

    Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

    Usage

    You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

    Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

    Data Extraction: In your terminal, you can call either

    make

    (recommended), or

    julia --project="." --eval "using Pkg; Pkg.instantiate()"
    julia --project="." extract-oq.jl

    Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

    Further Reading

    Implementation of our experiments: https://github.com/mirkobunse/regularized-oq

  3. o

    OSM buildings noisy labels dataset

    • explore.openaire.eu
    Updated Jun 26, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonas Gütter (2021). OSM buildings noisy labels dataset [Dataset]. http://doi.org/10.5281/zenodo.4446736
    Explore at:
    Dataset updated
    Jun 26, 2021
    Authors
    Jonas Gütter
    Description

    This dataset contains tile imagery from the OpenStreetMap project alongside label masks for buildings from OpenStreetMap. Besides the original clean label set, additional noisy label sets for random noise, removed and added buildings are provided. The purpose of this dataset is to provide training data for analysing the impact of noisy labels on the performance of models for semantic segmentation in Earth observation. The code for downloading and creating the datasets as well as for performing some preliminary analyses is also provided, however it is necessary to have access to a tile server where OpenStreetMap tiles can be downloaded in sufficient amounts. To reproduce the dataset and perform analysis on it, do the following: unzip data.zip and code.zip create the folder structure from data Build and activate a python environment from environment.yml Insert the url of a suitable tile server for OSM tiles in line 76 of utils.py Execute download_OSM_dataset.py to download OSM image tiles alongside OSM labels Execute create_noisy_labels.py for the OSM dataset to create noisy label sets Divide the images and labels into train and test data. split_data.py can be used as a baseline for this, but pathnames have to be adjusted and the corresponding directories have to be created first. Call train_model.py to train a model on the data. Specify the data size and the label set by giving command line arguments as shown in train_model.sh

  4. m

    ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured...

    • data.mendeley.com
    Updated Aug 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher Lynch (2025). ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured LLM-Generated Event Messaging: BERT, Keras, XGBoost, and Ensemble Methods [Dataset]. http://doi.org/10.17632/g2sdzmssgh.1
    Explore at:
    Dataset updated
    Aug 15, 2025
    Authors
    Christopher Lynch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset and code package supports the reproducible evaluation of Structured Large Language Model (LLM)-generated event messaging using multiple machine learning classifiers, including BERT (via TensorFlow/Keras), XGBoost, and ensemble methods. Including:

    • Tagged datasets (.csv): human-tagged gold labels for evaluation
    • Untagged datasets (.csv): raw data with Prompt matched to corresponding LLM-generated narrative
      • Suitable for inference, semi-automatic labeling, or transfer learning
    • Python and R code for preprocessing, model training, evaluation, and visualization
    • Configuration files and environment specifications to enable end-to-end reproducibility

    The materials accompany the study presented in [Lynch, Christopher, Erik Jensen, Ross Gore, et al. "AI-Generated Messaging for Life Events Using Structured Prompts: A Comparative Study of GPT With Human Experts and Machine Learning." TechRxiv (2025), DOI: https://doi.org/10.36227/techrxiv.174123588.85605769/v1], where Structured Narrative Prompting was applied to generate life-event messages from LLMs, followed by human annotation, and machine learning validation. This release provides complete transparency for reproducing reported metrics and facilitates further benchmarking in multilingual or domain-specific contexts.

    Value of the Data: * Enables direct replication of published results across BERT, Keras-based models, XGBoost, and ensemble classifiers. * Provides clean, human-tagged datasets suitable for training, evaluation, and bias analysis. * Offers untagged datasets for new annotation or domain adaptation. * Contains full preprocessing, training, and visualization code in Python and R for flexibility across workflows. * Facilitates extension into other domains (e.g., multilingual LLM messaging validation).

    Data Description: * /data/tagged/*.csv – Human-labeled datasets with schema defined in data_dictionary.csv. * /data/untagged/*.csv – Clean datasets without labels for inference or annotation. * /code/python/ – Python scripts for preprocessing, model training (BERT, Keras DNN, XGBoost), ensembling, evaluation metrics, and plotting. * /code/r/ – R scripts for exploratory data analysis, statistical testing, and replication of key figures/tables.

    File Formats: * Data: CSV (UTF-8, RFC 4180) * Code: .py, .R, .Rproj

    Ethics & Licensing * All data are de-identified and contain no PII. * Released under CC BY 4.0 (data) and MIT License (code).

    Limitations * Labels reflect annotator interpretations and may encode bias. * Models trained on English text; generalization to other languages requires adaptation.

    Funding Note * Funding sources provided time in support of human taggers annotating the data sets.

  5. h

    SRL4ORL: Improving Opinion Role Labeling Using Multi-Task Learning With...

    • heidata.uni-heidelberg.de
    zip
    Updated Feb 4, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ana Marasovic; Ana Marasovic (2019). SRL4ORL: Improving Opinion Role Labeling Using Multi-Task Learning With Semantic Role Labeling [Source Code] [Dataset]. http://doi.org/10.11588/DATA/LWN9XE
    Explore at:
    zip(14676065)Available download formats
    Dataset updated
    Feb 4, 2019
    Dataset provided by
    heiDATA
    Authors
    Ana Marasovic; Ana Marasovic
    License

    https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/LWN9XEhttps://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/LWN9XE

    Description

    This repository contains code for reproducing experiments done in Marasovic and Frank (2018). Paper abstract: For over a decade, machine learning has been used to extract opinion-holder-target structures from text to answer the question "Who expressed what kind of sentiment towards what?". Recent neural approaches do not outperform the state-of-the-art feature-based models for Opinion Role Labeling (ORL). We suspect this is due to the scarcity of labeled training data and address this issue using different multi-task learning (MTL) techniques with a related task which has substantially more data, i.e. Semantic Role Labeling (SRL). We show that two MTL models improve significantly over the single-task model for labeling of both holders and targets, on the development and the test sets. We found that the vanilla MTL model, which makes predictions using only shared ORL and SRL features, performs the best. With deeper analysis, we determine what works and what might be done to make further improvements for ORL. Data for ORL Download MPQA 2.0 corpus. Check mpqa2-pytools for example usage. Splits can be found in the datasplit folder. Data for SRL The data is provided by: CoNLL-2005 Shared Task, but the original words are from the Penn Treebank dataset, which is not publicly available. How to train models? python main.py --adv_coef 0.0 --model fs --exp_setup_id new --n_layers_orl 0 --begin_fold 0 --end_fold 4 python main.py --adv_coef 0.0 --model html --exp_setup_id new --n_layers_orl 1 --n_layers_shared 2 --begin_fold 0 --end_fold 4 python main.py --adv_coef 0.0 --model sp --exp_setup_id new --n_layers_orl 3 --begin_fold 0 --end_fold 4 python main.py --adv_coef 0.1 --model asp --exp_setup_id prior --n_layers_orl 3 --begin_fold 0 --end_fold 10

  6. RAD-ChestCT Dataset

    • zenodo.org
    Updated Apr 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rachel Lea Draelos; Rachel Lea Draelos; David Dov; Maciej A Mazurowski; Joseph Y. Lo; Joseph Y. Lo; Ricardo Henao; Geoffrey D. Rubin; Lawrence Carin; David Dov; Maciej A Mazurowski; Ricardo Henao; Geoffrey D. Rubin; Lawrence Carin (2023). RAD-ChestCT Dataset [Dataset]. http://doi.org/10.5281/zenodo.6406114
    Explore at:
    Dataset updated
    Apr 4, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rachel Lea Draelos; Rachel Lea Draelos; David Dov; Maciej A Mazurowski; Joseph Y. Lo; Joseph Y. Lo; Ricardo Henao; Geoffrey D. Rubin; Lawrence Carin; David Dov; Maciej A Mazurowski; Ricardo Henao; Geoffrey D. Rubin; Lawrence Carin
    Description

    Overview

    The RAD-ChestCT dataset is a large medical imaging dataset developed by Duke MD/PhD student Rachel Draelos during her Computer Science PhD supervised by Lawrence Carin. The full dataset includes 35,747 chest CT scans from 19,661 adult patients. This Zenodo repository contains an initial release of 3,630 chest CT scans, approximately 10% of the dataset. This dataset is of significant interest to the machine learning and medical imaging research communities.

    Papers

    The following published paper includes a description of how the RAD-ChestCT dataset was created: Draelos et al., "Machine-Learning-Based Multiple Abnormality Prediction with Large-Scale Chest Computed Tomography Volumes," Medical Image Analysis 2021. DOI: 10.1016/j.media.2020.101857 https://pubmed.ncbi.nlm.nih.gov/33129142/

    Two additional papers leveraging the RAD-ChestCT dataset are available as preprints:

    "Use HiResCAM instead of Grad-CAM for faithful explanations of convolutional neural networks" (https://arxiv.org/abs/2011.08891)

    "Explainable multiple abnormality classification of chest CT volumes with deep learning" (https://arxiv.org/abs/2111.12215)

    Details about the files included in this data release

    Metadata Files (4)

    CT_Scan_Metadata_Complete_35747.csv: includes metadata about the whole dataset, with information extracted from DICOM headers.

    Extrema_5747.csv: includes coordinates for lung bounding boxes for the whole dataset. Coordinates were derived computationally using a morphological image processing lung segmentation pipeline.

    Indications_35747.csv: includes scan indications for the whole dataset. Indications were extracted from the free-text reports.

    Summary_3630.csv: includes a listing of the 3,630 scans that are part of this repository.

    Label Files (3)

    The label files contain abnormality x location labels for the 3,630 shared CT volumes. Each CT volume is annotated with a matrix of 84 abnormality labels x 52 location labels. Labels were extracted from the free text reports using the Sentence Analysis for Radiology Label Extraction (SARLE) framework. For each CT scan, the label matrix has been flattened and the abnormalities and locations are separated by an asterisk in the CSV column headers (e.g. "mass*liver"). The labels can be used as the ground truth when training computer vision classifiers on the CT volumes. Label files include: imgtrain_Abnormality_and_Location_Labels.csv (for the training set)

    imgvalid_Abnormality_and_Location_Labels.csv (for the validation set)

    imgtest_Abnormality_and_Location_Labels.csv (for the test set)

    CT Volume Files (3,630)

    Each CT scan is provided as a compressed 3D numpy array (npz format). The CT scans can be read using the Python package numpy, version 1.14.5 and above.

    Related Code

    Code related to RAD-ChestCT is publicly available on GitHub at https://github.com/rachellea.

    Repositories of interest include:

    https://github.com/rachellea/ct-net-models contains PyTorch code to load the RAD-ChestCT dataset and train convolutional neural network models for multiple abnormality prediction from whole CT volumes.

    https://github.com/rachellea/ct-volume-preprocessing contains an end-to-end Python framework to convert CT scans from DICOM to numpy format. This code was used to prepare the RAD-ChestCT volumes.

    https://github.com/rachellea/sarle-labeler contains the Python implementation of the SARLE label extraction framework used to generate the abnormality and location label matrix from the free text reports. SARLE has minimal dependencies and the abnormality and location vocabulary terms can be easily modified to adapt SARLE to different radiologic modalities, abnormalities, and anatomical locations.

  7. Active Evaluation Software for Selection of Ground Truth Labels

    • catalog.data.gov
    • s.cnmilf.com
    Updated Jul 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2022). Active Evaluation Software for Selection of Ground Truth Labels [Dataset]. https://catalog.data.gov/dataset/active-evaluation-software-for-selection-of-ground-truth-labels-d0581
    Explore at:
    Dataset updated
    Jul 29, 2022
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    This software repository contains a python package Aegis (Active Evaluator Germane Interactive Selector) package that allows us to evaluate machine learning systems's performance (according to a metric such as accuracy) by adaptively sampling trials to label from an unlabeled test set to minimize the number of labels needed. This includes sample (public) data as well as a simulation script that tests different label-selecting strategies on already labelled test sets. This software is configured so that users can add their own data and system outputs to test evaluation.

  8. S

    machine learning models on the WDBC dataset

    • scidb.cn
    Updated Apr 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahdi Aghaziarati (2025). machine learning models on the WDBC dataset [Dataset]. http://doi.org/10.57760/sciencedb.23537
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 15, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Mahdi Aghaziarati
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset used in this study is the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, originally provided by the University of Wisconsin and obtained via Kaggle. It consists of 569 observations, each corresponding to a digitized image of a fine needle aspirate (FNA) of a breast mass. The dataset contains 32 attributes: one identifier column (discarded during preprocessing), one diagnosis label (malignant or benign), and 30 continuous real-valued features that describe the morphology of cell nuclei. These features are grouped into three statistical descriptors—mean, standard error (SE), and worst (mean of the three largest values)—for ten morphological properties including radius, perimeter, area, concavity, and fractal dimension. All feature values were normalized using z-score standardization to ensure uniform scale across models sensitive to input ranges. No missing values were present in the original dataset. Label encoding was applied to the diagnosis column, assigning 1 to malignant and 0 to benign cases. The dataset was split into training (80%) and testing (20%) sets while preserving class balance via stratified sampling. The accompanying Python source code (breast_cancer_classification_models.py) performs data loading, preprocessing, model training, evaluation, and result visualization. Four lightweight classifiers—Decision Tree, Naïve Bayes, Perceptron, and K-Nearest Neighbors (KNN)—were implemented using the scikit-learn library (version 1.2 or later). Performance metrics including Accuracy, Precision, Recall, F1-score, and ROC-AUC were calculated for each model. Confusion matrices and ROC curves were generated and saved as PNG files for interpretability. All results are saved in a structured CSV file (classification_results.csv) that contains the performance metrics for each model. Supplementary visualizations include all_feature_histograms.png (distribution plots for all standardized features), model_comparison.png (metric-wise bar plot), and feature_correlation_heatmap.png (Pearson correlation matrix of all 30 features). The data files are in standard CSV and PNG formats and can be opened using any spreadsheet or image viewer, respectively. No rare file types are used, and all scripts are compatible with any Python 3.x environment. This data package enables reproducibility and offers a transparent overview of how baseline machine learning models perform in the domain of breast cancer diagnosis using a clinically-relevant dataset.

  9. d

    Data from: Data Arrays for Microearthquake (MEQ) Monitoring using Deep...

    • catalog.data.gov
    • gdr.openei.org
    • +3more
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pennsylvania State University (2025). Data Arrays for Microearthquake (MEQ) Monitoring using Deep Learning for the Newberry EGS Sites [Dataset]. https://catalog.data.gov/dataset/data-arrays-for-microearthquake-meq-monitoring-using-deep-learning-for-the-newberry-egs-si-a2c4d
    Explore at:
    Dataset updated
    Jan 20, 2025
    Dataset provided by
    Pennsylvania State University
    Description

    The 'Machine Learning Approaches to Predicting Induced Seismicity and Imaging Geothermal Reservoir Properties' project looks to apply machine learning (ML) methods to Microearthquake (MEQ) data for imaging geothermal reservoir properties and forecasting seismic events, in order to advance geothermal exploration and safe geothermal energy production. As part of the project, this submission provides data arrays for 149 microearthquakes between the year 2012 and 2013 at the Newberry EGS Site for use with the Deep Learning Algorithm that has been developed. The data provided includes raw waveform data, location data, normalized waveform data, and processed waveform data. Penn State Geothermal Team has shared the following files from the project: - 149 microearthquakes (MEQs) between 2012 and 2013 at Newberry EGS sites, 'Normalized Waveform Inputs.npz' are normalized waveforms. - labels of 149 MEQs: Processed Waveform Inputs.npz - location labels of 149 MEQs: Location Data.npz Note: .npz is the python file format by NumPy that provides storage of array data.

  10. Z

    Human Inner Ear Anatomy: Labeled Volume CT Data of Inner Ear Fluid Space and...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blaimer, Martin (2023). Human Inner Ear Anatomy: Labeled Volume CT Data of Inner Ear Fluid Space and Anatomical Landmarks [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8277158
    Explore at:
    Dataset updated
    Aug 30, 2023
    Dataset provided by
    Blaimer, Martin
    Neun, Tilman
    Pelt, Daniël M.
    Stebani, Jannik
    Rak, Kristen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The provided dataset comprises 43 instances of temporal bone volume CT scans. The scans were performed on human cadaveric specimen with a resulting isotropic voxel size of (99 \times 99 \times 99 \, \, \mathrm{\mu m}^3). Voxel-wise image labels of the fluid space of the bony labyrinth, subdivided in the three semantic classes cochlear volume, vestibular volume and semicircular canal volume are provided. In addition, each dataset contains JSON-like descriptor data defining the voxel coordinates of the anatomical landmarks: (1) apex of the cochlea, (2) oval window and (3) round window. The dataset can be used to train and evaluate algorithmic machine learning models for automated innear ear analysis in the context of the supervised learning paradigm.

    Usage Notes

    The datasets are formatted in the HDF5 format developed by the HDF5 Group. We utilized and thus recommend the usage of Python bindings pyHDF to handle the datasets.

    The flat-panel volume CT raw data, labels and landmarks are saved in the HDF5-internal file structure using the respective group and datasets:

    raw/raw-0 label/label-0 landmark/landmark-0 landmark/landmark-1 landmark/landmark-2

    Array raw and label data can be read from the file by indexing into an opened h5py file handle, for example as numpy.ndarray. Further metadata is contained in the attribute dictionaries of the raw and label datasets.

    Landmark coordinate data is available as an attribute dict and contains the coordinate system (LPS or RAS), IJK voxel coordinates and label information. The helicotrema or cochlea top is globally saved in landmark 0, the oval window in landmark 1 and the round window in landmark 2. Read as a Python dictionary, exemplary landmark information for a dataset may reads as follows:

    {'coordsys': 'LPS', 'id': 1, 'ijk_position': array([181, 188, 100]), 'label': 'CochleaTop', 'orientation': array([-1., -0., -0., -0., -1., -0., 0., 0., 1.]), 'xyz_position': array([ 44.21109689, -139.38058589, -183.48249736])}

    {'coordsys': 'LPS', 'id': 2, 'ijk_position': array([222, 182, 145]), 'label': 'OvalWindow', 'orientation': array([-1., -0., -0., -0., -1., -0., 0., 0., 1.]), 'xyz_position': array([ 48.27890112, -139.95991131, -179.04103763])}

    {'coordsys': 'LPS', 'id': 3, 'ijk_position': array([223, 209, 147]), 'label': 'RoundWindow', 'orientation': array([-1., -0., -0., -0., -1., -0., 0., 0., 1.]), 'xyz_position': array([ 48.33120126, -137.27135678, -178.8665465 ])}

  11. Z

    A new remote sensing benchmark dataset for machine learning applications :...

    • data.niaid.nih.gov
    • explore.openaire.eu
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Romain Wenger (2024). A new remote sensing benchmark dataset for machine learning applications : MultiSenGE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6375465
    Explore at:
    Dataset updated
    Dec 16, 2024
    Dataset provided by
    Jonathan Weber
    Germain Forestier
    Lhassane Idoumghar
    Romain Wenger
    Anne Puissant
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    [UPDATE] You can now access MultiSen (GE and NA) collection though this portal : https://doi.theia.data-terra.org/ai4lcc/?lang=en

    MultiSenGE is a new large-scale multimodal and multitemporal benchmark dataset covering one of the biggest administrative region located in the Eastern part of France. It contains 8,157 patches of 256 * 256 pixels for Sentinel-2 L2A, Sentinel-1 GRD and a regional LULC topographic regional database.

    Every file has a specific nomenclature :

    Sentinel-1 patches: {tile}_{date}_S1_{x-pixel-coordinate}_{y-pixel-coordinate}.tif

    Sentinel-2 patches: {tile}_{date}_S2_{x-pixel-coordinate}_{y-pixel-coordinate}.tif

    Ground reference patches: {tile}_GR_{x-pixel-coordinate}_{y-pixel-coordinate}.tif

    JSON Labels: {tile}_{x-pixel-coordinate}_{y-pixel-coordinate}.json

    where tile is the Sentinel-2 tile number, date the date of acquisition of the patch, x-pixel-coordinate and y-pixel-coordinate are the coordinates of the patch in the tile.

    In addition, you can find a set of useful python tools for extracting information about the dataset on Github : https://github.com/r-wenger/MultiSenGE-Tools

    First experiments based on this dataset is in press in ISPRS Annals : Wenger, R., Puissant, A., Weber, J., Idoumghar, L., and Forestier, G.: MULTISENGE: A MULTIMODAL AND MULTITEMPORAL BENCHMARK DATASET FOR LAND USE/LAND COVER REMOTE SENSING APPLICATIONS, ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci., V-3-2022, 635–640, https://doi.org/10.5194/isprs-annals-V-3-2022-635-2022, 2022.

    Due to the large size of the dataset, you will only find the associated JSON files on this Zenodo repository. To download the Sentinel-1, Sentinel-2 patches and the reference data, please do so via these links:

    Sentinel-1 temporal serie patches: https://s3.unistra.fr/a2s_datasets/MultiSenGE/s1.tgz

    Sentinel-2 temporal serie patches: https://s3.unistra.fr/a2s_datasets/MultiSenGE/s2.tgz

    Ground reference patches: https://s3.unistra.fr/a2s_datasets/MultiSenGE/ground_reference.tgz

    JSON files for each patch: https://s3.unistra.fr/a2s_datasets/MultiSenGE/labels.tgz

  12. e

    Industrial Machine Tool Element Surface Defect Dataset - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Jun 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Industrial Machine Tool Element Surface Defect Dataset - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/63d20a5e-3584-5096-a34d-d3f93fcc8857
    Explore at:
    Dataset updated
    Jun 2, 2024
    Description

    Using Machine Learning Techniques in general and Deep Learning techniques in specific needs a certain amount of data often not available in large quantities in some technical domains. The manual inspection of Machine Tool Components, as well as the manual end of line check of products, are labour intensive tasks in industrial applications that often want to be automated by companies. To automate the classification processes and to develop reliable and robust Machine Learning based classification and wear prognostics models there is a need for real-world datasets to train and test models on. The dataset contains 1104 channel 3 images with 394 image-annotations for the surface damage type “pitting”. The annotations made with the annotation tool labelme, are available in JSON format and hence convertible to VOC and COCO format. All images come from two BSD types. The dataset available for download is divided into two folders, data with all images as JPEG, label with all annotations, and saved_model with a baseline model. The authors also provide a python script to divide the data and labels into three different split types – train_test_split, which splits images into the same train and test data-split the authors used for the baseline model, wear_dev_split, which creates all 27 wear developments and type_split, which splits the data into the occurring BSD-types. One of the two mentioned BSD types is represented with 69 images and 55 different image-sizes. All images with this BSD type come either in a clean or soiled condition. The other BSD type is shown on 325 images with two image-sizes. Since all images of this type have been taken with continuous time the degree of soiling is evolving. Also, the dataset contains as above mentioned 27 pitting development sequences with every 69 images. Instruction dataset split The authors of this dataset provide 3 types of different dataset splits. To get the data split you have to run the python script split_dataset.py. Script inputs: split-type (mandatory) output directory (mandatory) Different split-types: train_test_split: splits dataset into train and test data (80%/20%) wear_dev_split: splits dataset into 27 wear-developments type_split: splits dataset into different BSD types Example: C:\Users\Desktop>python split_dataset.py --split_type=train_test_split --output_dir=BSD_split_folder

  13. Face-Detection-Dataset

    • kaggle.com
    • gts.ai
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fares Elmenshawii (2023). Face-Detection-Dataset [Dataset]. https://www.kaggle.com/datasets/fareselmenshawii/face-detection-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Fares Elmenshawii
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The dataset comprises 16.7k images and 2 annotation files, each in a distinct format. The first file, labeled "Label," contains annotations with the original scale, while the second file, named "yolo_format_labels," contains annotations in YOLO format. The dataset was obtained by employing the OIDv4 toolkit, specifically designed for scraping data from Google Open Images. Notably, this dataset exclusively focuses on face detection.

    This dataset offers a highly suitable resource for training deep learning models specifically designed for face detection tasks. The images within the dataset exhibit exceptional quality and have been meticulously annotated with bounding boxes encompassing the facial regions. The annotations are provided in two formats: the original scale, denoting the pixel coordinates of the bounding boxes, and the YOLO format, representing the bounding box coordinates in normalized form.

    The dataset was meticulously curated by scraping relevant images from Google Open Images through the use of the OIDv4 toolkit. Only images that are pertinent to face detection tasks have been included in this dataset. Consequently, it serves as an ideal choice for training deep learning models that specifically target face detection tasks.

  14. d

    Data from: Label-free timing analysis of SiPM-based modularized detectors...

    • search.dataone.org
    • datadryad.org
    Updated Apr 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pengcheng Ai; Le Xiao; Zhi Deng; Yi Wang; Xiangming Sun; Guangming Huang; Dong Wang; Yulei Li; Xinchi Ran (2024). Label-free timing analysis of SiPM-based modularized detectors with physics-constrained deep learning [Dataset]. http://doi.org/10.5061/dryad.qv9s4mwkj
    Explore at:
    Dataset updated
    Apr 27, 2024
    Dataset provided by
    Dryad Digital Repository
    Authors
    Pengcheng Ai; Le Xiao; Zhi Deng; Yi Wang; Xiangming Sun; Guangming Huang; Dong Wang; Yulei Li; Xinchi Ran
    Time period covered
    Oct 25, 2023
    Description

    Pulse timing is an important topic in nuclear instrumentation, with far-reaching applications from high energy physics to radiation imaging. While high-speed analog-to-digital converters become more and more developed and accessible, their potential uses and merits in nuclear detector signal processing are still uncertain, partially due to associated timing algorithms which are not fully understood and utilized. In the paper "Label-free timing analysis of SiPM-based modularized detectors with physics-constrained deep learning", we propose a novel method based on deep learning for timing analysis of modularized detectors without explicit needs of labelling event data. By taking advantage of the intrinsic time correlations, a label-free loss function with a specially designed regularizer is formed to supervise the training of neural networks towards a meaningful and accurate mapping function. We mathematically demonstrate the existence of the optimal function desired by the method, and ..., , The program is tested with the following setting:

    python==3.9.5 tf-nightly-gpu==2.7.0.dev20210730 keras-nightly==2.7.0.dev2021073000 tensorflow-model-optimization==0.6.0 numpy==1.19.5 scipy==1.6.2 matplotlib==3.3.4 pandas==1.3.0 pyyaml==5.4.1

    Newer versions may also work. Â

    When data and software files are downloaded, please unzip to a shared folder so that the computer code will work properly.

    root directory: contain main routine scripts to train neural networks (NNs), and README.md (this file).

    Â ./s_toy_routine.py: Python script to train NNs on the toy experiment. Â ./s_basic_routine.py: Python script to train NNs on the ECAL experiment. Â ./README.md: This file.

    ./conf/ directory: configuration files for main routine scripts.

    Â laser_in2048_[cluster]_[frequency]_2ch_internal.yaml: Configuration files for the toy experiment. Use optional [cluster] to select data, use opitional low-pass filter with [frequency] to preprocess data. Â `ecal_[network]_in800_8ch_intern..., # Data from: Label-free timing analysis of SiPM-based modularized detectors with physics-constrained deep learning

    Introduction

    This repository holds the computer code and raw data to reproduce the results in the paper: Label-free timing analysis of SiPM-based modularized detectors with physics-constrained deep learning

    Pulse timing is an important topic in nuclear instrumentation, with far-reaching applications from high energy physics to radiation imaging. While high-speed analog-to-digital converters become more and more developed and accessible, their potential uses and merits in nuclear detector signal processing are still uncertain, partially due to associated timing algorithms which are not fully understood and utilized.

    In the paper "Label-free timing analysis of SiPM-based modularized detectors with physics-constrained deep learning", we propose a novel method based on deep learning for timing analysis of modularized detectors without explicit needs of labelling event da...

  15. Network Traffic Data-Malicious Activity Detection

    • kaggle.com
    Updated Mar 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Advait Nandakumar Menon (2024). Network Traffic Data-Malicious Activity Detection [Dataset]. https://www.kaggle.com/datasets/advaitnmenon/network-traffic-data-malicious-activity-detection/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 18, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Advait Nandakumar Menon
    Description

    Documentation for Network Traffic Dataset

    Dataset Overview

    This dataset consists of network traffic captured from a Kali Linux machine, aimed at helping the development and evaluation of machine learning models for distinguishing between normal and malicious (specifically flood attack) network activities. It includes a variety of features essential for identifying potential cybersecurity threats alongside labels indicating whether each packet is part of flood traffic.

    Data Collection Methodology

    The dataset was carefully compiled using network traffic captured from a dedicated Kali Linux setup. The capture environment consisted of a Kali Linux machine configured to generate and capture both normal and malicious network traffic and a target machine running a Windows OS to simulate a real-world network environment.

    Traffic Generation:

    Normal Traffic: Involved routine network activities such as web browsing and pinging between the Kali Linux machine and the Windows machine.

    Malicious Traffic: Utilized hping3 to simulate flood attacks, specifically ICMP flood attacks, targeting the Windows machine from the Kali Linux machine [1].

    Capture Process: Wireshark was used on the Kali Linux machine to capture all incoming and outgoing network traffic [2]. The capture was set up to record detailed packet information, including timestamps, source and destination IP addresses, ports, and protocols. The captures were conducted with careful monitoring to precisely mark the start and end times of the flood attack for accurate dataset labeling.

    Dataset Description

    The dataset is a CSV file containing a comprehensive collection of network traffic packets labeled to distinguish between normal and malicious traffic. It includes the following columns:

    Timestamp: The capture time of each packet, providing insights into the traffic flow and enabling analysis of traffic patterns over time. Source IP Address: Identifies the origin of the packet, crucial for pinpointing potential sources of attacks. Destination IP Address: Indicates the packet's intended recipient, useful for identifying targeted resources. Source Port and Destination Port: Offer insights into the services involved in the communication. Protocol: Specifies the protocol used, such as TCP, UDP, or ICMP, essential for analyzing the nature of the traffic. Length: The size of the packet in bytes, which can signal unusual traffic patterns often associated with malicious activities. bad_packet: A binary label with 1 indicating traffic identified as part of a flood attack and 0 denoting normal traffic. Precise timestamps marking the start and end of flood attacks were used to accurately label this column. Packets captured within these defined intervals were marked as malicious (bad_packet = 1), whereas all others were considered normal traffic. Python and Pandas were used for the labeling process [3][4].

    Potential Applications

    a. Intrusion Detection Systems (IDS): The dataset can be used in training models to enhance IDS capabilities, enabling more effective detection of flood-based network attacks. b. Network Traffic Monitoring: Tools making use of machine learning can leverage the dataset for more accurate network traffic monitoring, identifying and alerting suspicious activities in real time. c. Cybersecurity Training: Educational institutions and training programs can use the dataset to provide practical experience in machine learning-based threat detection.

    Proposed Machine Learning Technique: Supervised Machine Learning, specifically Deep Learning with Convolutional Neural Networks (CNNs).

    CNNs, even though it is usually used for image processing, have shown promise in analyzing sequential data. The spatial hierarchy in network packets (from individual bytes to overall packet structure) can be analogous to the patterns CNNs excel at identifying. Utilizing CNNs could allow for the extraction of complex data in network traffic that indicate malicious activities, improving detection accuracy beyond traditional methods.

    Conclusion

    This dataset represents a significant step towards using machine learning for cybersecurity, specifically in the field of intrusion detection and network monitoring. By providing a detailed and accurately labeled dataset of normal and malicious network traffic, it lays the groundwork for developing complex models capable of identifying and mitigating flood attacks in real-time. In the future, we could include a broader range of attack types and more traffic patterns, further enhancing the dataset's utility and the effectiveness of models trained on it.

    References [1] https://linux.die.net/man/8/hping3 [2] https://www.wireshark.org/docs/ [3] https://pandas.pydata.org/docs/ [4] https://docs.python.org/3/tutorial/index.html

  16. t

    LUMPI: The Leibniz University Multi-Perspective Intersection Dataset

    • service.tib.eu
    • data.uni-hannover.de
    Updated May 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). LUMPI: The Leibniz University Multi-Perspective Intersection Dataset [Dataset]. https://service.tib.eu/ldmservice/dataset/luh-lumpi
    Explore at:
    Dataset updated
    May 16, 2025
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    Increasing improvements in sensor technologies as well as machine learning methods allow an efficient collection, processing and analysis of the dynamic environment, which can be used for detection and tracking of traffic participants. Current datasets in this domain mostly present a single view, preventing high accurate pose estimations by occlusions. The integration of different, simultaneously acquired data allows to exploit and develop collaboration principles to increase the quality, reliability and integrity of the derived information. This work addresses this problem by providing a multi-view dataset, including 2D image information (videos) and 3D point clouds with labels of the traffic participants in the scene. The dataset was recorded during different weather and light conditions on several days at a large junction in Hanover, Germany. Paper Dataset teaser video: https://youtu.be/elwFdCu5IFo Dataset download path: https://data.uni-hannover.de/vault/ikg/busch/LUMPI/ Labeling process pipeline video: https://youtu.be/Ns6qsHsb06E Python-SDK: https://github.com/St3ff3nBusch/LUMPI-SDK-Python Labeling Tool/ C++ SDK: https://github.com/St3ff3nBusch/LUMPI-Labeling

  17. f

    English Wikipedia labeled mid-level wikiprojects set

    • figshare.com
    txt
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumit Asthana; Aaron Halfaker (2023). English Wikipedia labeled mid-level wikiprojects set [Dataset]. http://doi.org/10.6084/m9.figshare.5640526.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    figshare
    Authors
    Sumit Asthana; Aaron Halfaker
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains a set of 93,449 observations providing wikiproject mid-level category labels associated with talk pages for respective Wikipedia articles.Each observation includes a talk page title, talk page id, latest revision id when the extraction was done, associated wikiproject templates and mid-level wikiproject categories the corresponding article page belongs to.The dataset was generated using a python script that ran mysql queries on Wikimedia PAWS.To ensure a balanced set, the script extracts a random set of 2000 page-ids per mid-level category totaling about 93,449 observationsThis dataset opens up immense possibilities for topic oriented research around Wikipedia as it exposes high level topic data associated with Wikipedia pages.

  18. R

    Retail Product Checkout Dataset

    • universe.roboflow.com
    zip
    Updated Dec 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samrat Sahoo (2022). Retail Product Checkout Dataset [Dataset]. https://universe.roboflow.com/samrat-sahoo/groceries-6pfog/model/6
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 17, 2022
    Dataset authored and provided by
    Samrat Sahoo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Groceries Items Bounding Boxes
    Description

    Overview

    Via https://rpc-dataset.github.io: * This dataset enjoys the following characteristics: (1) It is by far the largest dataset in terms of both product image quantity and product categories. (2) It includes single-product images taken in a controlled environment and multi-product images taken by the checkout system. (3) It provides different levels of annotations for the checkout images. Comparing with the existing datasets, ours is closer to the realistic setting and can derive a variety of research problems.

    Use Cases

    This dataset could be used to create an automatic item counter or checkout system using computer vision with Roboflow's API, Python Package, or other deployment options, such as Web Browser, iOS device, or to an Edge Device: https://docs.roboflow.com/inference/hosted-api.

    Using this Dataset

    This dataset has been licensed on a CC BY 4.0 license. You can copy, redistribute, and modify the images as long as there is appropriate credit to the authors of the dataset.

    About Roboflow

    Roboflow creates tools that make computer vision easy to use for any developer, even if you're not a machine learning expert. You can use it to organize, label, inspect, convert, and export your image datasets. And even to train and deploy computer vision models with no code required. https://i.imgur.com/WHFqYSJ.png" alt="https://roboflow.com">

  19. f

    Overall adulterated honey data set from each brand and botanical origins...

    • plos.figshare.com
    xls
    Updated Jun 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esmael Ahmed (2024). Overall adulterated honey data set from each brand and botanical origins label of honey. [Dataset]. http://doi.org/10.1371/journal.pdig.0000536.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 10, 2024
    Dataset provided by
    PLOS Digital Health
    Authors
    Esmael Ahmed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overall adulterated honey data set from each brand and botanical origins label of honey.

  20. MSD-I: Million Song Dataset with Images for Multimodal Genre Classification

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, tsv +1
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergio Oramas; Sergio Oramas (2020). MSD-I: Million Song Dataset with Images for Multimodal Genre Classification [Dataset]. http://doi.org/10.5281/zenodo.1240485
    Explore at:
    tsv, application/gzip, txtAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sergio Oramas; Sergio Oramas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Million Song Dataset (https://labrosa.ee.columbia.edu/millionsong/) is a collection of metadata and precomputed audio features for 1 million songs. Along with this dataset, a dataset with annotations of 15 top-level genres with a single label per song was released. In our work, we combine the CD2c version of this genre datase (http://www.tagtraum.com/msd_genre_datasets.html) with a collection of album cover images.


    The final dataset contains 30,713 tracks from the MSD and their related album cover images, each annotated with a unique genre label among 15 classes. Based on an initial analysis on the images, we identified that this set of tracks is associated to 16,753 albums, yielding an average of 1.8 songs per album.

    We randomly divide the dataset into three parts: 70% for training, 15% for validation, and 15% for test, with no artist and album overlap across these sets. This is crucial to avoid possible overfitting, as the classifier may learn to predict the artist instead of the genre.

    Content:

    MSD-I dataset (mapping, metadata, annotations and links to images)
    Data splits and feature vectors for TISMIR single-label classification experiments

    These data can be used together with the Tartarus deep learning python module https://github.com/sergiooramas/tartarus.

    Scientific References:

    Please cite the following paper if using MSD-I dataset or Tartarus software.

    Oramas, S., Barbieri, F., Nieto, O., and Serra, X (2018). Multimodal Deep Learning for Music Genre Classification, Transactions of the International Society for Music Information Retrieval, V(1).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1
Organization logo

Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python

Related Article
Explore at:
txtAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

GitHub page: https://github.com/soarsmu/NICHE

Search
Clear search
Close search
Google apps
Main menu