100+ datasets found

n
Data from: Assessing predictive performance of supervised machine learning...
data.niaid.nih.gov
datasetcatalog.nlm.nih.gov
+1more
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evans Omondi (2023). Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model [Dataset]. http://doi.org/10.5061/dryad.wh70rxwrh
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.wh70rxwrh
Dataset updated
May 23, 2023
Dataset provided by
Strathmore University
Authors
Evans Omondi
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The diamond is 58 times harder than any other mineral in the world, and its elegance as a jewel has long been appreciated. Forecasting diamond prices is challenging due to nonlinearity in important features such as carat, cut, clarity, table, and depth. Against this backdrop, the study conducted a comparative analysis of the performance of multiple supervised machine learning models (regressors and classifiers) in predicting diamond prices. Eight supervised machine learning algorithms were evaluated in this work including Multiple Linear Regression, Linear Discriminant Analysis, eXtreme Gradient Boosting, Random Forest, k-Nearest Neighbors, Support Vector Machines, Boosted Regression and Classification Trees, and Multi-Layer Perceptron. The analysis is based on data preprocessing, exploratory data analysis (EDA), training the aforementioned models, assessing their accuracy, and interpreting their results. Based on the performance metrics values and analysis, it was discovered that eXtreme Gradient Boosting was the most optimal algorithm in both classification and regression, with a R2 score of 97.45% and an Accuracy value of 74.28%. As a result, eXtreme Gradient Boosting was recommended as the optimal regressor and classifier for forecasting the price of a diamond specimen. Methods Kaggle, a data repository with thousands of datasets, was used in the investigation. It is an online community for machine learning practitioners and data scientists, as well as a robust, well-researched, and sufficient resource for analyzing various data sources. On Kaggle, users can search for and publish various datasets. In a web-based data-science environment, they can study datasets and construct models.
Machine learning algorithm validation with a limited sample size
plos.figshare.com
text/x-python
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
Explore at:
text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0224365
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.
H
Replication Data for: When Correlation Is Not Enough: Validating Populism...
dataverse.harvard.edu
Updated Oct 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Jankowski; Robert A. Huber (2022). Replication Data for: When Correlation Is Not Enough: Validating Populism Scores from Supervised Machine-Learning Models [Dataset]. http://doi.org/10.7910/DVN/DDXRXI
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/DDXRXI
Dataset updated
Oct 21, 2022
Dataset provided by
Harvard Dataverse
Authors
Michael Jankowski; Robert A. Huber
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Despite the ongoing success of populist parties in many parts of the world, we lack comprehensive information about parties' level of populism over time. A recent contribution to Political Analysis by Di Cocco and Monechi (DCM) suggests that this research gap can be closed by predicting parties' populism scores from their election manifestos using supervised machine-learning. In this paper, we provide a detailed discussion of the suggested approach. Building on recent debates about the validation of machine-learning models, we argue that the validity checks provided in DCM's paper are insufficient. We conduct a series of additional validity checks and empirically demonstrate that the approach is not suitable for deriving populism scores from texts. We conclude that measuring populism over time and between countries remains an immense challenge for empirical research. More generally, our paper illustrates the importance of more comprehensive validations of supervised machine-learning models.
u
Data from: Voxelized fragment dataset for machine learning
investigacion.ujaen.es
data-staging.niaid.nih.gov
+1more
Updated 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
López Ruiz, Alfonso; Rueda Ruiz, Antonio Jesús; Segura, Rafael; Ogayar Anguita, Carlos Javier; Navarro, Pablo; Fuertes García, José Manuel; López Ruiz, Alfonso; Rueda Ruiz, Antonio Jesús; Segura, Rafael; Ogayar Anguita, Carlos Javier; Navarro, Pablo; Fuertes García, José Manuel (2024). Voxelized fragment dataset for machine learning [Dataset]. https://investigacion.ujaen.es/documentos/67321f1daea56d4af04863a7?lang=ca
Explore at:
Dataset updated
2024
Authors
López Ruiz, Alfonso; Rueda Ruiz, Antonio Jesús; Segura, Rafael; Ogayar Anguita, Carlos Javier; Navarro, Pablo; Fuertes García, José Manuel; López Ruiz, Alfonso; Rueda Ruiz, Antonio Jesús; Segura, Rafael; Ogayar Anguita, Carlos Javier; Navarro, Pablo; Fuertes García, José Manuel
Description
One of the primary challenges inherent in utilizing deep learning models is the scarcity and accessibility hurdles associated with acquiring datasets of sufficient size to facilitate effective training of these networks. This is particularly significant in object detection, shape completion, and fracture assembly. Instead of scanning a large number of real-world fragments, it is possible to generate massive datasets with synthetic pieces. However, realistic fragmentation is computationally intensive in the preparation (e.g., pre-factured models) and generation. Otherwise, simpler algorithms such as Voronoi diagrams provide faster processing speeds at the expense of compromising realism. Hence, it is required to balance computational efficiency and realism for generating large datasets for marching learning.

We proposed a GPU-based fragmentation method to improve the baseline Discrete Voronoi Chain aimed at completing this dataset generation task. The dataset in this repository includes voxelized fragments from high-resolution 3D models, curated to be used as training sets for machine learning models. More specifically, these models come from an archaeological dataset, which led to more than 1M fragments from 1,052 Iberian vessels. In this dataset, fragments are not stored individually; instead, the fragmented voxelizations are provided in a compressed binary file (.rle.zip). Once uncompressed, each fragment is represented by a different number in the grid. The class to which each vessel belongs is also included in class.csv. The GPU-based pipeline that generated this dataset is explained at https://doi.org/10.1016/j.cag.2024.104104.

Please, note that this dataset originally provided voxel data, point clouds and triangle meshes. However, we opted for including only voxel data because 1) the original dataset is too large to be uploaded to Zenodo and 2) the original intent of our paper is to generate implicit data in the form of voxels. If interested in the whole dataset (450GB), please visit the web page of our research institute.
m
Find Ideal Location for Business in Bangladesh
data.mendeley.com
Updated Sep 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Faisal Bin Ashraf (2021). Find Ideal Location for Business in Bangladesh [Dataset]. http://doi.org/10.17632/v2k2jvjwrh.1
Explore at:
Unique identifier
https://doi.org/10.17632/v2k2jvjwrh.1
Dataset updated
Sep 22, 2021
Authors
Faisal Bin Ashraf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Bangladesh
Description
The dataset has 21 columns that carry the features (questions) of 988 respondents. The efficiency of any machine learning model is heavily dependent on its raw initial dataset. For this, we had to be extra careful in gathering our information. We figured out that for our particular problem, we had to go forward with data that was not only authentic but also versatile enough to get the proper information from relevant sources. Hence we opted to build our dataset by dispatching a survey questionnaire among targeted audiences. Firstly, we built the questionnaire with inquiries that were made after keen observation. Studying the behavior from our intended audience, we came up with factual and informative queries that generated appropriate data. Our prime audience were those who were highly into buying fashion accessories and hence we had created a set of questionnaires that emphasized on questions related to that field. We had a total of twenty one well revised questions that gave us an overview of all answers that were going to be needed within the proximity of our system. As such, we had the opportunity to gather over half a thousand authentic leads and concluded upon our initial raw dataset accordingly.
r
Data from: Continuous Training and Deployment of Deep Learning Models
resodate.org
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ioannis Prapas; Behrouz Derakhshan; Alireza Rezaei Mahdiraji; Volker Markl (2021). Continuous Training and Deployment of Deep Learning Models [Dataset]. http://doi.org/10.14279/depositonce-12793
Explore at:
Unique identifier
https://doi.org/10.14279/depositonce-12793
Dataset updated
Dec 9, 2021
Dataset provided by
Technische Universität Berlin
DepositOnce
Authors
Ioannis Prapas; Behrouz Derakhshan; Alireza Rezaei Mahdiraji; Volker Markl
Description
Deep Learning (DL) has consistently surpassed other Machine Learning methods and achieved state-of-the-art performance in multiple cases. Several modern applications like financial and recommender systems require models that are constantly updated with fresh data. The prominent approach for keeping a DL model fresh is to trigger full retraining from scratch when enough new data are available. However, retraining large and complex DL models is time-consuming and compute-intensive. This makes full retraining costly, wasteful, and slow. In this paper, we present an approach to continuously train and deploy DL models. First, we enable continuous training through proactive training that combines samples of historical data with new streaming data. Second, we enable continuous deployment through gradient sparsification that allows us to send a small percentage of the model updates per training iteration. Our experimental results with LeNet5 on MNIST and modern DL models on CIFAR-10 show that proactive training keeps models fresh with comparable—if not superior—performance to full retraining at a fraction of the time. Combined with gradient sparsification, sparse proactive training enables very fast updates of a deployed model with arbitrarily large sparsity, reducing communication per iteration up to four orders of magnitude, with minimal—if any—losses in model quality. Sparse training, however, comes at a price; it incurs overhead on the training that depends on the size of the model and increases the training time by factors ranging from 1.25 to 3 in our experiments. Arguably, a small price to pay for successfully enabling the continuous training and deployment of large DL models.
Network Digital Twin-Generated Dataset for Machine Learning-Based Detection...
zenodo.org
data.niaid.nih.gov
bin
Updated Nov 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amit Karamchandani Batra; Amit Karamchandani Batra; Javier Nuñez Fuente; Luis de la Cal García; Luis de la Cal García; Yenny Moreno Meneses; Alberto Mozo Velasco; Alberto Mozo Velasco; Antonio Pastor Perales; Antonio Pastor Perales; Diego R. López; Diego R. López; Javier Nuñez Fuente; Yenny Moreno Meneses (2024). Network Digital Twin-Generated Dataset for Machine Learning-Based Detection of Benign and Malicious Heavy Hitter Flows [Dataset]. http://doi.org/10.5281/zenodo.14134646
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14134646
Dataset updated
Nov 13, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Amit Karamchandani Batra; Amit Karamchandani Batra; Javier Nuñez Fuente; Luis de la Cal García; Luis de la Cal García; Yenny Moreno Meneses; Alberto Mozo Velasco; Alberto Mozo Velasco; Antonio Pastor Perales; Antonio Pastor Perales; Diego R. López; Diego R. López; Javier Nuñez Fuente; Yenny Moreno Meneses
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 11, 2024
Description
The dataset used in this study is publicly available for research purposes. If you are using this dataset, please cite the following paper, which outlines the complete details of the dataset and the methodology used for its generation:

Amit Karamchandani, Javier Núñez, Luis de-la-Cal, Yenny Moreno, Alberto Mozo, Antonio Pastor, "On the Applicability of Network Digital Twins in Generating Synthetic Data for Heavy Hitter Discrimination," under submission.

This dataset contains a synthetic dataset generated to differentiate between benign and malicious heavy hitter flows within complex network environments. Heavy Hitter flows, which include high-volume data transfers, can significantly impact network performance, leading to congestion and degraded quality of service. Distinguishing legitimate heavy hitter activity from malicious Distributed Denial-of-Service traffic is critical for network management and security, yet existing datasets lack the granularity needed for training machine learning models to effectively make this distinction.

To address this, a Network Digital Twin (NDT) approach was utilized to emulate realistic network conditions and traffic patterns, enabling automated generation of labeled data for both benign and malicious HH flows alongside regular traffic.

The feature set includes flow statistics commonly used in network analysis, such as:

Traffic protocol type,

Flow duration (the time between the initial and final packet in both directions),

Total count of payload packets transmitted in both directions,

Cumulative bytes transmitted in both directions,

Time discrepancy between the first packet observations at the source and destination,

Packet and byte transmission rates per second within each interval, and

Total packet and byte counts within each interval in both directions.
R
Data from: Curated Dataset of Association Constants Between a Cyclodextrin...
entrepot.recherche.data.gouv.fr
Updated Oct 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gokhan Tahıl; Gokhan Tahıl; Fabien Delorme; Fabien Delorme; Daniel Le Berre; Daniel Le Berre; Éric Monflier; Éric Monflier; Adlane Sayede; Adlane Sayede; Sébastien Tilloy; Sébastien Tilloy (2023). Curated Dataset of Association Constants Between a Cyclodextrin and a Guest for Machine Learning [Dataset]. http://doi.org/10.57745/LWKGLU
Explore at:
Unique identifier
https://doi.org/10.57745/LWKGLU
Dataset updated
Oct 12, 2023
Dataset provided by
Recherche Data Gouv
Authors
Gokhan Tahıl; Gokhan Tahıl; Fabien Delorme; Fabien Delorme; Daniel Le Berre; Daniel Le Berre; Éric Monflier; Éric Monflier; Adlane Sayede; Adlane Sayede; Sébastien Tilloy; Sébastien Tilloy
License
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Dataset funded by
ANR
Description
Determining the association constant between a cyclodextrin and a guest molecule is an important task for various applications in various industrial and academical fields. However, such a task is time consuming, tedious and requires samples of both molecules. A significant number of association constants and relevant data is available from the literature. The availability of data makes the use of machine learning techniques to predict association constants possible. However, such data is mainly available from tables in articles or appendices. It is necessary to make them available in a computer friendly format and to curate them. Furthermore, the raw data need to be enriched with physicochemical information about each molecule and when such information does not allow to discriminate molecules, some additional data is needed. We present a dataset built from data gathered from the literature.
Correlation of class sample size in the training set with classification...
plos.figshare.com
xls
Updated May 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hannah Metzler; Hubert Baginski; David Garcia; Thomas Niederkrotenthaler (2024). Correlation of class sample size in the training set with classification performance. [Dataset]. http://doi.org/10.1371/journal.pone.0300917.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300917.t001
Dataset updated
May 14, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Hannah Metzler; Hubert Baginski; David Garcia; Thomas Niederkrotenthaler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Correlation of class sample size in the training set with classification performance.
d
Training data from: Machine learning predicts which rivers, streams, and...
search.dataone.org
data.niaid.nih.gov
+1more
Updated Jun 21, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Greenhill; Hannah Druckenmiller; Sherrie Wang; David Keiser; Manuela Girotto; Jason Moore; Nobuhiro Yamaguchi; Alberto Todeschini; Joseph Shapiro (2024). Training data from: Machine learning predicts which rivers, streams, and wetlands the Clean Water Act regulates [Dataset]. http://doi.org/10.5061/dryad.m63xsj47s
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.m63xsj47s
Dataset updated
Jun 21, 2024
Dataset provided by
Dryad Digital Repository
Authors
Simon Greenhill; Hannah Druckenmiller; Sherrie Wang; David Keiser; Manuela Girotto; Jason Moore; Nobuhiro Yamaguchi; Alberto Todeschini; Joseph Shapiro
Time period covered
Jan 1, 2023
Description
We assess which waters the Clean Water Act protects and how Supreme Court and White House rules change this regulation. We train a deep learning model using aerial imagery and geophysical data to predict 150,000 jurisdictional determinations from the Army Corps of Engineers, each deciding regulation for one water resource. Under a 2006 Supreme Court ruling, the Clean Water Act protects two-thirds of US streams and over half of wetlands; under a 2020 White House rule, it protects under half of streams and a fourth of wetlands, implying deregulation of 690,000 stream miles, 35 million wetland acres, and 30% of waters around drinking water sources. Our framework can support permitting, policy design, and use of machine learning in regulatory implementation problems.Â , This dataset contains data used to train the models., , # Training data from: Machine learning predicts which rivers, streams, and wetlands the Clean Water Act regulates

This dataset contains data used to train the models in Greenhill et al. (2023). All data are publicly available and can be accessed either through Google Earth Engine or directly from the data providers, as described in Table S3 of the Supplementary Material. In addition, we are providing access to the full set of pre-processed inputs for model training via this repository. We are also providing access to a subset of the data used for prediction, as well as all data needed for reproducing the results of the paper, in another Dryad repository: . All code written for the project is available at .

Description of the data and file structure

The files here include:

Trained models, saved in PyTorch Checkpoint format: wotus_model.pth.tar, resource_type_model.pth.tar, cowardin_code_model.pth.tar, ajd_model.pth.tar.

...
Z
Training dataset used in the magazine paper entitled "A Flexible Machine...
data.niaid.nih.gov
data-staging.niaid.nih.gov
+1more
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francisco Wilhelmi (2020). Training dataset used in the magazine paper entitled "A Flexible Machine Learning-Aware Architecture for Future WLANs" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3626690
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Universitat Pompeu Fabra
Authors
Francisco Wilhelmi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A Flexible Machine Learning-Aware Architecture for Future WLANs

Authors: Francesc Wilhelmi, Sergio Barrachina-Muñoz, Boris Bellalta, Cristina Cano, Anders Jonsson & Vishnu Ram.

Abstract: Lots of hopes have been placed in Machine Learning (ML) as a key enabler of future wireless networks. By taking advantage of the large volumes of data generated by networks, ML is expected to deal with the ever-increasing complexity of networking problems. Unfortunately, current networking systems are not yet prepared for supporting the ensuing requirements of ML-based applications, especially for enabling procedures related to data collection, processing, and output distribution. This article points out the architectural requirements that are needed to pervasively include ML as part of future wireless networks operation. To this aim, we propose to adopt the International Telecommunications Union (ITU) unified architecture for 5G and beyond. Specifically, we look into Wireless Local Area Networks (WLANs), which, due to their nature, can be found in multiple forms, ranging from cloud-based to edge-computing-like deployments. Based on ITU's architecture, we provide insights on the main requirements and the major challenges of introducing ML to the multiple modalities of WLANs.

Dataset description: This is the dataset generated for training a Neural Network (NN) in the Access Point (AP) (re)association problem in IEEE 802.11 Wireless Local Area Networks (WLANs).

In particular, the NN is meant to output a prediction function of the throughput that a given station (STA) can obtain from a given Access Point (AP) after association. The features included in the dataset are:

Identifier of the AP to which the STA has been associated.

RSSI obtained from the AP to which the STA has been associated.

Data rate in bits per second (bps) that the STA is allowed to use for the selected AP.

Load in packets per second (pkt/s) that the STA generates.

Percentage of data that the AP is able to serve before the user association is done.

Amount of traffic load in pkt/s handled by the AP before the user association is done.

Airtime in % that the AP enjoys before the user association is done.

Throughput in pkt/s that the STA receives after the user association is done.

The dataset has been generated through random simulations, based on the model provided in https://github.com/toniadame/WiFi_AP_Selection_Framework. More details regarding the dataset generation have been provided in https://github.com/fwilhelmi/machine_learning_aware_architecture_wlans.
d
Data from: Using convolutional neural networks to efficiently extract...
dataone.org
data.niaid.nih.gov
+1more
Updated May 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rachel Reeb; Naeem Aziz; Samuel Lapp; Justin Kitzes; J. Mason Heberling; Sara Kuebbing (2025). Using convolutional neural networks to efficiently extract immense phenological data from community science images [Dataset]. http://doi.org/10.5061/dryad.mkkwh7123
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.mkkwh7123
Dataset updated
May 20, 2025
Dataset provided by
Dryad Digital Repository
Authors
Rachel Reeb; Naeem Aziz; Samuel Lapp; Justin Kitzes; J. Mason Heberling; Sara Kuebbing
Time period covered
Jan 1, 2021
Description
Community science image libraries offer a massive, but largely untapped, source of observational data for phenological research. The iNaturalist platform offers a particularly rich archive, containing more than 49 million verifiable, georeferenced, open access images, encompassing seven continents and over 278,000 species. A critical limitation preventing scientists from taking full advantage of this rich data source is labor. Each image must be manually inspected and categorized by phenophase, which is both time-intensive and costly. Consequently, researchers may only be able to use a subset of the total number of images available in the database. While iNaturalist has the potential to yield enough data for high-resolution and spatially extensive studies, it requires more efficient tools for phenological data extraction. A promising solution is automation of the image annotation process using deep learning. Recent innovations in deep learning have made these open-source tools accessibl...
Data from: Data accompanying publication: "General Chemically Intuitive...
researchdiscovery.drexel.edu
resodate.org
+1more
Updated Nov 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miguel Nouman; Richard B. Canty; Brent A. Koscher; Matthew A. McDonald; Klavs F. Jensen (2024). Data accompanying publication: "General Chemically Intuitive Atom-Level DFT Descriptors for Machine Learning Approaches to Reaction Condition Prediction" [Dataset]. https://researchdiscovery.drexel.edu/esploro/outputs/dataset/Data-accompanying-publication-General-Chemically-Intuitive/991021985205504721
Explore at:
Dataset updated
Nov 26, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Miguel Nouman; Richard B. Canty; Brent A. Koscher; Matthew A. McDonald; Klavs F. Jensen
Time period covered
Nov 26, 2024
Description
Embeddings and raw files to complement the paper "General Chemically Intuitive Atom-Level DFT Descriptors for Machine Learning Approaches to Reaction Condition Prediction". The embeddings should be all the data needed for full reproducibility of the results published. The GitHub repo GeneralDFT (https://github.com/moleculebits/GeneralDFT) contains the python scripts required to make use of the data, along with some basic plotting functionalities.
RadIOCD
zenodo.org
data.europa.eu
csv, txt, zip
Updated May 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Panagiotis Kasnesis; Christos Chatzigeorgiou; Vasileios Doulgerakis; Dimitris Uzunidis; Evangelos Margaritis; Charalampos Patrikakis; Stelios Mitilineos; Panagiotis Kasnesis; Christos Chatzigeorgiou; Vasileios Doulgerakis; Dimitris Uzunidis; Evangelos Margaritis; Charalampos Patrikakis; Stelios Mitilineos (2024). RadIOCD [Dataset]. http://doi.org/10.5281/zenodo.10731407
Explore at:
txt, csv, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10731407
Dataset updated
May 7, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Panagiotis Kasnesis; Christos Chatzigeorgiou; Vasileios Doulgerakis; Dimitris Uzunidis; Evangelos Margaritis; Charalampos Patrikakis; Stelios Mitilineos; Panagiotis Kasnesis; Christos Chatzigeorgiou; Vasileios Doulgerakis; Dimitris Uzunidis; Evangelos Margaritis; Charalampos Patrikakis; Stelios Mitilineos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 23, 2024
Description
This repository introduces the RadIOCD (Radar-based Interior Object Classification Dataset), which contains sparse point cloud representations of interior objects, collected by subjects wearing a commercial off-the-shelf mmWave radar. RadIOCD includes the recording of 10 volunteers, aged between 25 and 50 years old. A total amount of 5 objects, with the participants moving towards them in 2 different environments were recorded. RadIoCD includes sparse 3D point cloud data, together with their doppler velocity provided by the mmWave radar. The files were stored in CSV format to ensure its reuse.

The scope of RadIoCD is the availability of data for the recognition of objects solely recorded by the mmWave radar, to be used in applications were the vision-based classification is not robust (e.g, in search and rescue operation where there is smoke inside a building). Furthermore, we showcase that this dataset contains
enough data to apply Machine Learning-based techniques, and ensure that it could generalize in different environments and "unseen" subjects.
h
Classification of structural building damage grades from multi-temporal...
heidata.uni-heidelberg.de
text/x-python, zip
Updated Jul 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vivien Zahs; Vivien Zahs; Katharina Anders; Julia Kohns; Alexander Stark; Bernhard Höfle; Bernhard Höfle; Katharina Anders; Julia Kohns; Alexander Stark (2023). Classification of structural building damage grades from multi-temporal photogrammetric point clouds using a machine learning model trained on virtual laser scanning data [Data and Source Code] [Dataset]. http://doi.org/10.11588/DATA/D3WZID
Explore at:
zip(121042315), text/x-python(3843), text/x-python(2422), zip(284705851), text/x-python(4299), zip(10181624901), zip(879751204)Available download formats
Unique identifier
https://doi.org/10.11588/DATA/D3WZID
Dataset updated
Jul 20, 2023
Dataset provided by
heiDATA
Authors
Vivien Zahs; Vivien Zahs; Katharina Anders; Julia Kohns; Alexander Stark; Bernhard Höfle; Bernhard Höfle; Katharina Anders; Julia Kohns; Alexander Stark
License
https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/D3WZIDhttps://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/D3WZID
Description
Automatic damage assessment by analysing UAV-derived 3D point clouds provides fast information on the damage situation after an earthquake. However, the assessment of different damage grades is challenging given the variety in damage characteristics and limited transferability of methods to other geographic regions or data sources. We present a novel change-based approach to automatically assess multi-class building damage from real-world point clouds using a machine learning model trained on virtual laser scanning (VLS) data. Therein, we (1) identify object-specific point cloud-based change features, (2) extract changed building parts using k-means clustering, (3) train a random forest machine learning model with VLS data based on object-specific change features, and (4) use the classifier to assess building damage in real-world photogrammetric point clouds. We evaluate the classifier with respect to its capacity to classify three damage grades (heavy, extreme, destruction) in pre-event and post-event point clouds of an earthquake in L’Aquila (Italy). Using object-specific change features derived from bi-temporal point clouds, our approach is transferable with respect to multi-source input point clouds used for model training (VLS) and application (real-world photogrammetry). We further achieve geographic transferability by using simulated training data which characterises damage grades across different geographic regions. The model yields high multi-target classification accuracies (overall accuracy: 92.0%–95.1%). Classification performance improves only slightly when using real-world region-specific training data (3% higher overall accuracies). We consider our approach especially relevant for applications where timely information on the damage situation is required and sufficient real-world training data is not available. This dataset includes 3D building models (building_models.zip) representing the target damage grades (no damage, heavy damage, extreme damage, destruction) of this study Python source code (code.zip) used in this study to (1) generate simulated multi-temporal 3D point clouds using HELIOS++ (https://github.com/3dgeo-heidelberg/helios), (2) extract damaged building parts using k-means clustering, (3) compute object-specific geometric change features per building (4) train a multi-target random forest classifier to classify buildings into four damage grades based on object-specific change features.
h
ArPod
huggingface.co
Updated Aug 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arabic Machine Learning (2024). ArPod [Dataset]. https://huggingface.co/datasets/arbml/ArPod
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 22, 2024
Dataset authored and provided by
Arabic Machine Learning
Description
Dataset Card for ArPod

Dataset Summary

[More Information Needed]

Supported Tasks and Leaderboards

[More Information Needed]

Languages

[More Information Needed]

Dataset Structure Data Instances

[More Information Needed]

Data Fields

[More Information Needed]

Data Splits

[More Information Needed]

Dataset Creation Curation Rationale

[More Information Needed]

Source Data… See the full description on the dataset page: https://huggingface.co/datasets/arbml/ArPod.
Z
Data from: Dataset for Investigating Anomalies in Compute Clusters
data.niaid.nih.gov
zenodo.org
Updated Nov 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
McSpadden, Diana; Yasir, Alanazi; Hess, Bryan; Hild, Laura; Jones, Mark; Lu, Yiyang; Mohammed, Ahmed; Moore, Wesley; Ren, Jie; Schram, Malachi; Smirni, Evgenia (2023). Dataset for Investigating Anomalies in Compute Clusters [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10058229
Explore at:
Dataset updated
Nov 29, 2023
Dataset provided by
William & Mary
Thomas Jefferson National Accelerator Facility
Authors
McSpadden, Diana; Yasir, Alanazi; Hess, Bryan; Hild, Laura; Jones, Mark; Lu, Yiyang; Mohammed, Ahmed; Moore, Wesley; Ren, Jie; Schram, Malachi; Smirni, Evgenia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract The dataset was collected for 332 compute nodes throughout May 19 - 23, 2023. May 19 - 22 characterizes normal compute cluster behavior, while May 23 includes an anomalous event. The dataset includes eight CPU, 11 disk, 47 memory, and 22 Slurm metrics. It represents five distinct hardware configurations and contains over one million records, totaling more than 180GB of raw data. Background Motivated by the goal to develop a digital twin of a compute cluster, the dataset was collected using a Prometheus server (1) scraping the Thomas Jefferson National Accelerator Facility (JLab) batch cluster used to run an assortment of physics analysis and simulation jobs, where analysis workloads leverage data generated from the laboratory's electron accelerator, and simulation workloads generate large amounts of flat data that is then carved to verify amplitudes. Metrics were scraped from the cluster throughout May 19 - 23, 2023. Data from May 19 to May 22 primarily reflected normal system behavior, while May 23, 2023, recorded a notable anomaly. This anomaly was severe enough to necessitate intervention by JLab IT Operations staff. The metrics were collected from CPU, disk, memory, and Slurm. Metrics related to CPU, disk, and memory provide insights into the status of individual compute nodes. Furthermore, Slurm metrics collected from the network have the capability to detect anomalies that may propagate to compute nodes executing the same job. Usage Notes While the data from May 19 - 22 characterizes normal compute cluster behavior, and May 23 includes anomalous observations, the dataset cannot be considered labeled data. The set of nodes and the exact start and end time affected nodes demonstrate abnormal effects are unclear. Thus, the dataset could be used to develop unsupervised machine-learning algorithms to detect anomalous events in a batch cluster. https://doi.org/10.48550/arXiv.2311.16129
Data from: Machine learning model inputs, outputs, and scripts associated...
osti.gov
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bruen, Michael; Fluet-Chouinard, Etienne; Forbes, Brieanne; Garayburu-Caruso, Vanessa A.; Gary, Stefan; Goldman, Amy E.; Malhotra, Avni; Mehan, Sushant; Rivera Waterman, Bre; Rubin, Tod; Scheibe, Timothy D.; Stegen, James C.; Ward, Nicholas (2024). Machine learning model inputs, outputs, and scripts associated with “Artificial intelligence-guided iterations between observations and modeling significantly improve environmental predictions” [Dataset]. https://www.osti.gov/dataexplorer/biblio/2998468
Explore at:
Dataset updated
Dec 31, 2024
Dataset provided by
United States Department of Energyhttp://energy.gov/
Department of Energy Biological and Environmental Research Program
Office of Sciencehttp://www.er.doe.gov/
River Corridor Hydro-biogeochemistry from Molecular to Multi-Basin Scales SFA
Authors
Bruen, Michael; Fluet-Chouinard, Etienne; Forbes, Brieanne; Garayburu-Caruso, Vanessa A.; Gary, Stefan; Goldman, Amy E.; Malhotra, Avni; Mehan, Sushant; Rivera Waterman, Bre; Rubin, Tod; Scheibe, Timothy D.; Stegen, James C.; Ward, Nicholas
Description
NOTE: The manuscript associated with this data package is currently in review. The data may be revised based on reviewer feedback. Upon manuscript acceptance, this data package will be updated with the final dataset and additional metadata.This data package is associated with the manuscript “Artificial intelligence-guided iterations between observations and modeling significantly improve environmental predictions” (Malhotra et al., in prep). This effort was designed following ICON (integrated, coordinated, open, and networked) principles to facilitate a model-experiment (ModEx) iteration approach, leveraging crowdsourced sampling across the contiguous United States (CONUS). New machine learning models were created every month to guide sampling locations. Data from the resulting samples were used to test and rebuild the machine learning models for the next round of sampling guidance. Associated sediment and water geochemistry and in situ sensor data can be found at https://data.ess-dive.lbl.gov/datasets/doi:10.15485/1923689, https://data.ess-dive.lbl.gov/datasets/doi:10.15485/1729719, and https://data.ess-dive.lbl.gov/datasets/doi:10.15485/1603775. This data package is associated with two GitHub repositories found at https://github.com/parallelworks/dynamic-learning-rivers and https://github.com/WHONDRS-Hub/ICON-ModEx_Open_Manuscript. In addition to this readme, this data package also includes two file-level metadata (FLMD) files that describes each file and two data dictionaries (DD) that describe all column/row headers and variable definitions. This data package consists of two main folders (1) dynamic-learning-rivers and (2) ICON-ModEx_Open_Manuscript whichmore » contain snapshots of the associated GitHub repositories. The input data, output data, and machine learning models used to guide sampling locations are within dynamic-learning-rivers. The folder is organized into five top-level directories: (1) “input_data” holds the training data for the ML models; (2) “ml_models” holds machine learning (ML) models trained on the data in “input_data”; (3) “examples” contains files for direct experimentation with the machine learning model, including scripts for setting up “hindcast” run; (4) “scripts” contains data preprocessing and postprocessing scripts and intermediate results specific to this data set that bookend the ML workflow; and (5) “output_data” holds the overall results of the ML model on that branch. Each trained ML model resides on its own branch in the repository; this means that inputs and outputs can be different branch-to-branch. There is also one hidden directory “.github/workflows”. This hidden directory contains information for how to run the ML workflow as an end-to-end automated GitHub Action but it is not needed for reusing the ML models archived here. Please see the top-level README.md in the GitHub repository for more details on the automation.The scripts and data used to create figures in the manuscript are within ICON-ModEx_Open_Manuscript. The folder is organized into four folders which contain the scripts, data, and pdf for each figure. Within the “fig-model-score-evolution” folder, there is a folder called “intermediate_branch_data” which contains some intermediate files pulled from dynamic-learning-rivers and reorganized to easily integrate into the workflows. NOTE: THIS FOLDER INCLUDES THE FILES AT THE POINT OF PAPER SUBMISSION. IT WILL BE UPDATED ONCE THE PAPER IS ACCEPTED WITH ANY REVISIONS AND WILL INCLUDE A DD/FLMD AT THAT POINT.« less
Practice makes master: Movie Collection Analysis
kaggle.com
zip
Updated May 19, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Beyjin (2019). Practice makes master: Movie Collection Analysis [Dataset]. https://www.kaggle.com/beyjin/movies-1990-to-2017
Explore at:
zip(22259569 bytes)Available download formats
Dataset updated
May 19, 2019
Authors
Beyjin
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Context

The data set represents movies which were released in the years of xxx up to 2017. It is kept quite general and does not have any real problem / challenge as a background. The whole data set is meant to practice different types of techniques for a data analyst / data scientist.

I´d like also to mention that the Dataset is not fully cleaned. Reasoning is that it shall demonstrate you the real life of being an Analyst / Scientist. Get Data - Prep Data - Analyse Data - Visualize Data - Predict Outcomes of different Use Cases ;-)

Content

I love watching movies and therefore tried to combine this hobby with my current self studies of becoming a data scientist. Therefore I needed a way to obtain a data set which included information of movies so that I could play around and use my learnings. On the first glance I could see that the data set can be used for Regressions, Classifications or potentially even Deep Learning (such as Image Recognition - Post URLs are given)

I did aquire this dataset by using different steps. First I did check the internet for a specific API which I may use to receive movie information. After a short time I got to know omdbapi.com. With the help of this API I was able to fetch information based on the title of the movies.

Now I had another problem. I was missing movie titles. The next search had begun. I couldn´t find an API for that but I did see that wikipedia was quite well structured in regards to movie titles. So I did build a scraper to fetch all movie titles from 1990 to 2017.

After receiving all the data I could finally start to obtain all movie information of a movie by having the title + year (there might be movies which have the same name). Unfortunately some movie titles have been written differently and so I had a failure rate of 10% for obtaining the movie data. Based on the 10% failed movie titles - I did an Text Analysis and found around 400 000 new Movies / Series. The latest Version should include nearly 200 000 different movies based on the imdbID.

Additionally I did clean some of the information such as Genre, Actors and Writer for better analysing. Each of the CSV File can be joined by the imdbID. Be aware that some information are missing and declared as _NOT_GIVEN.

Acknowledgements

Thanks to omdbapi.com for providing such a good API and well structured data.

Inspiration

The inspiration of this data set came from getting into the practical flow of developing an image recognition application. Recognize the genre of a movie by the given poster. By request I could also provide the images of the movies. But for the given Dataset I do have the following questions in my mind:

Does the Genre correlate with the given Scoring?

Can we see a hype of specific genre over the past years?

Do the actors or writer prefer a genre?

Do the actors or writer have an impact on the imdb scoring?

Do the directors have prefered actors for their movies?

Do the directors have prefered writers for their movies?

How many movies have been produced by the directors?

Is there any relation between the director and the imdb rating?

.... many more questions :-)
Weather prediction dataset
zenodo.org
kaggle.com
csv, png, txt
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Huber; Florian Huber (2024). Weather prediction dataset [Dataset]. http://doi.org/10.5281/zenodo.4770937
Explore at:
csv, png, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4770937
Dataset updated
Jul 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Florian Huber; Florian Huber
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset created for machine learning and deep learning training and teaching purposes.
Can for instance be used for classification, regression, and forecasting tasks.
Complex enough to demonstrate realistic issues such as overfitting and unbalanced data, while still remaining intuitively accessible.

ORIGINAL DATA TAKEN FROM:

EUROPEAN CLIMATE ASSESSMENT & DATASET (ECA&D), file created on 22-04-2021
THESE DATA CAN BE USED FREELY PROVIDED THAT THE FOLLOWING SOURCE IS ACKNOWLEDGED:

Klein Tank, A.M.G. and Coauthors, 2002. Daily dataset of 20th-century surface
air temperature and precipitation series for the European Climate Assessment.
Int. J. of Climatol., 22, 1441-1453.
Data and metadata available at http://www.ecad.eu

For more information see metadata.txt file.

The Python code used to create the weather prediction dataset from the ECA&D data can be found on GitHub: https://github.com/florian-huber/weather_prediction_dataset
(this repository also contains Jupyter notebooks with teaching examples)

Facebook

Twitter

Click to copy link

Link copied

Cite

Evans Omondi (2023). Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model [Dataset]. http://doi.org/10.5061/dryad.wh70rxwrh

Data from: Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5061/dryad.wh70rxwrh

Dataset updated

May 23, 2023

Dataset provided by

Strathmore University

Authors

Evans Omondi

License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

The diamond is 58 times harder than any other mineral in the world, and its elegance as a jewel has long been appreciated. Forecasting diamond prices is challenging due to nonlinearity in important features such as carat, cut, clarity, table, and depth. Against this backdrop, the study conducted a comparative analysis of the performance of multiple supervised machine learning models (regressors and classifiers) in predicting diamond prices. Eight supervised machine learning algorithms were evaluated in this work including Multiple Linear Regression, Linear Discriminant Analysis, eXtreme Gradient Boosting, Random Forest, k-Nearest Neighbors, Support Vector Machines, Boosted Regression and Classification Trees, and Multi-Layer Perceptron. The analysis is based on data preprocessing, exploratory data analysis (EDA), training the aforementioned models, assessing their accuracy, and interpreting their results. Based on the performance metrics values and analysis, it was discovered that eXtreme Gradient Boosting was the most optimal algorithm in both classification and regression, with a R2 score of 97.45% and an Accuracy value of 74.28%. As a result, eXtreme Gradient Boosting was recommended as the optimal regressor and classifier for forecasting the price of a diamond specimen. Methods Kaggle, a data repository with thousands of datasets, was used in the investigation. It is an online community for machine learning practitioners and data scientists, as well as a robust, well-researched, and sufficient resource for analyzing various data sources. On Kaggle, users can search for and publish various datasets. In a web-based data-science environment, they can study datasets and construct models.

Clear search

Close search

Google apps

Main menu

Data from: Assessing predictive performance of supervised machine learning...

Machine learning algorithm validation with a limited sample size

Replication Data for: When Correlation Is Not Enough: Validating Populism...

Data from: Voxelized fragment dataset for machine learning

Find Ideal Location for Business in Bangladesh

Data from: Continuous Training and Deployment of Deep Learning Models

Network Digital Twin-Generated Dataset for Machine Learning-Based Detection...

Data from: Curated Dataset of Association Constants Between a Cyclodextrin...

Correlation of class sample size in the training set with classification...

Training data from: Machine learning predicts which rivers, streams, and...

Description of the data and file structure

Training dataset used in the magazine paper entitled "A Flexible Machine...

Data from: Using convolutional neural networks to efficiently extract...

Data from: Data accompanying publication: "General Chemically Intuitive...

RadIOCD

Classification of structural building damage grades from multi-temporal...

ArPod

Data from: Dataset for Investigating Anomalies in Compute Clusters

Data from: Machine learning model inputs, outputs, and scripts associated...

Practice makes master: Movie Collection Analysis

Context

Content

Acknowledgements

Inspiration

Weather prediction dataset

Data from: Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model