100+ datasets found

l
Data set for article: Effect of data preprocessing and machine learning...
opal.latrobe.edu.au
researchdata.edu.au
hdf
Updated Mar 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wil Gardner (2024). Data set for article: Effect of data preprocessing and machine learning hyperparameters on mass spectrometry imaging models [Dataset]. http://doi.org/10.26181/22671022.v1
Explore at:
hdfAvailable download formats
Unique identifier
https://doi.org/10.26181/22671022.v1
Dataset updated
Mar 7, 2024
Dataset provided by
La Trobe
Authors
Wil Gardner
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This data set is uploaded as supporting information for the publication entitled:Effect of data preprocessing and machine learning hyperparameters on mass spectrometry imaging modelsFiles are as follows:polymer_microarray_data.mat - MATLAB workspace file containing peak-picked ToF-SIMS data (hyperspectral array) for the polymer microarray sample.nylon_data.mat - MATLAB workspace file containing m/z binned ToF-SIMS data (hyperspectral array) for the semi-synthetic nylon data set, generated from 7 nylon samples.Additional details about the datasets can be found in the published article.If you use this data set in your work, please cite our work as follows:Cite as: Gardner et al.. J. Vac. Sci. Technol. A 41, 000000 (2023); doi: 10.1116/6.0002788
f
Data from: Count-Based Morgan Fingerprint: A More Efficient and...
acs.figshare.com
xlsx
Updated Jul 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shifa Zhong; Xiaohong Guan (2023). Count-Based Morgan Fingerprint: A More Efficient and Interpretable Molecular Representation in Developing Machine Learning-Based Predictive Regression Models for Water Contaminants’ Activities and Properties [Dataset]. http://doi.org/10.1021/acs.est.3c02198.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.est.3c02198.s002
Dataset updated
Jul 5, 2023
Dataset provided by
ACS Publications
Authors
Shifa Zhong; Xiaohong Guan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
In this study, we introduce the count-based Morgan fingerprint (C-MF) to represent chemical structures of contaminants and develop machine learning (ML)-based predictive models for their activities and properties. Compared with the binary Morgan fingerprint (B-MF), C-MF not only qualifies the presence or absence of an atom group but also quantifies its counts in a molecule. We employ six different ML algorithms (ridge regression, SVM, KNN, RF, XGBoost, and CatBoost) to develop models on 10 contaminant-related data sets based on C-MF and B-MF to compare them in terms of the model’s predictive performance, interpretation, and applicability domain (AD). Our results show that C-MF outperforms B-MF in nine of 10 data sets in terms of model predictive performance. The advantage of C-MF over B-MF is dependent on the ML algorithm, and the performance enhancements are proportional to the difference in the chemical diversity of data sets calculated by B-MF and C-MF. Model interpretation results show that the C-MF-based model can elucidate the effect of atom group counts on the target and have a wider range of SHAP values. AD analysis shows that C-MF-based models have an AD similar to that of B-MF-based ones. Finally, we developed a “ContaminaNET” platform to deploy these C-MF-based models for free use.
u
Unimelb Corridor Synthetic dataset
figshare.unimelb.edu.au
png
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Debaditya Acharya; KOUROSH KHOSHELHAM; STEPHAN WINTER (2023). Unimelb Corridor Synthetic dataset [Dataset]. http://doi.org/10.26188/5dd8b8085b191
Explore at:
pngAvailable download formats
Unique identifier
https://doi.org/10.26188/5dd8b8085b191
Dataset updated
May 30, 2023
Dataset provided by
The University of Melbourne
Authors
Debaditya Acharya; KOUROSH KHOSHELHAM; STEPHAN WINTER
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data-set is a supplementary material related to the generation of synthetic images of a corridor in the University of Melbourne, Australia from a building information model (BIM). This data-set was generated to check the ability of deep learning algorithms to learn task of indoor localisation from synthetic images, when being tested on real images. =============================================================================The following is the name convention used for the data-sets. The brackets show the number of images in the data-set.REAL DATAReal
---------------------> Real images (949 images)

Gradmag-Real -------> Gradmag of real data (949 images)SYNTHETIC DATASyn-Car
----------------> Cartoonish images (2500 images)

Syn-pho-real ----------> Synthetic photo-realistic images (2500 images)

Syn-pho-real-tex -----> Synthetic photo-realistic textured (2500 images)

Syn-Edge --------------> Edge render images (2500 images)

Gradmag-Syn-Car ---> Gradmag of Cartoonish images (2500 images)=============================================================================Each folder contains the images and their respective groundtruth poses in the following format [ImageName X Y Z w p q r].To generate the synthetic data-set, we define a trajectory in the 3D indoor model. The points in the trajectory serve as the ground truth poses of the synthetic images. The height of the trajectory was kept in the range of 1.5–1.8 m from the floor, which is the usual height of holding a camera in hand. Artificial point light sources were placed to illuminate the corridor (except for Edge render images). The length of the trajectory was approximately 30 m. A virtual camera was moved along the trajectory to render four different sets of synthetic images in Blender*. The intrinsic parameters of the virtual camera were kept identical to the real camera (VGA resolution, focal length of 3.5 mm, no distortion modeled). We have rendered images along the trajectory at 0.05 m interval and ± 10° tilt.The main difference between the cartoonish (Syn-car) and photo-realistic images (Syn-pho-real) is the model of rendering. Photo-realistic rendering is a physics-based model that traces the path of light rays in the scene, which is similar to the real world, whereas the cartoonish rendering roughly traces the path of light rays. The photorealistic textured images (Syn-pho-real-tex) were rendered by adding repeating synthetic textures to the 3D indoor model, such as the textures of brick, carpet and wooden ceiling. The realism of the photo-realistic rendering comes at the cost of rendering times. However, the rendering times of the photo-realistic data-sets were considerably reduced with the help of a GPU. Note that the naming convention used for the data-sets (e.g. Cartoonish) is according to Blender terminology.An additional data-set (Gradmag-Syn-car) was derived from the cartoonish images by taking the edge gradient magnitude of the images and suppressing weak edges below a threshold. The edge rendered images (Syn-edge) were generated by rendering only the edges of the 3D indoor model, without taking into account the lighting conditions. This data-set is similar to the Gradmag-Syn-car data-set, however, does not contain the effect of illumination of the scene, such as reflections and shadows.*Blender is an open-source 3D computer graphics software and finds its applications in video games, animated films, simulation and visual art. For more information please visit: http://www.blender.orgPlease cite the papers if you use the data-set:1) Acharya, D., Khoshelham, K., and Winter, S., 2019. BIM-PoseNet: Indoor camera localisation using a 3D indoor model and deep learning from synthetic images. ISPRS Journal of Photogrammetry and Remote Sensing. 150: 245-258.2) Acharya, D., Singha Roy, S., Khoshelham, K. and Winter, S. 2019. Modelling uncertainty of single image indoor localisation using a 3D model and deep learning. In ISPRS Annals of Photogrammetry, Remote Sensing & Spatial Information Sciences, IV-2/W5, pages 247-254.
d
Training dataset for NABat Machine Learning V1.0
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
GNSS-RO Machine Learning Feature Sets used for Classification of Cubesat...
zenodo.org
application/gzip
Updated Jan 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tim Dittmann; Tim Dittmann; Hyeyeon Chang; Yu (Jade) Morton; Hyeyeon Chang; Yu (Jade) Morton (2025). GNSS-RO Machine Learning Feature Sets used for Classification of Cubesat GNSS-RO Disturbances [Dataset]. http://doi.org/10.5281/zenodo.14081023
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14081023
Dataset updated
Jan 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tim Dittmann; Tim Dittmann; Hyeyeon Chang; Yu (Jade) Morton; Hyeyeon Chang; Yu (Jade) Morton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General Description:

This dataset includes feature sets extracted from GNSS-RO profiles used for multiclass classifcation model training and testing the classifier from

Dittmann, Chang, & Morton (202?) Machine Learning Classification of Ionosphere and RFI Disturbances in Spaceborne GNSS Radio Occultation Measurements.

In this work we apply a combination of physics-based feature engineering with data-driven supervised machine learning to improve classification of low earth orbit Spire Global GNSS radio occultation disturbances.

Included in this dataset:

data ├── converted_labels.pkl #(feature set catalogs) ├── **.pkl └── data ├── feature_set_all_single_file │ └── all_fdf_v2.pkl #(6 months of feature sets concatenated into single object) └── feature_sets ├── 2022.206.117.01.01.G23.SC001_0001.pkl #(individual profile feature sets) ├── 202***.pkl

References:

This conference proceedings and/or manuscript (TBA) for further description.

This notebook for reproducing this experiment's result and using these datasets.

The python library associated with this dataset.
f
Data_Sheet_2_On the Automation of Flood Event Separation From Continuous...
frontiersin.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Henning Oppel; Benjamin Mewes (2023). Data_Sheet_2_On the Automation of Flood Event Separation From Continuous Time Series.pdf [Dataset]. http://doi.org/10.3389/frwa.2020.00018.s002
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/frwa.2020.00018.s002
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Henning Oppel; Benjamin Mewes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Can machine learning effectively lower the effort necessary to extract important information from raw data for hydrological research questions? On the example of a typical water-management task, the extraction of direct runoff flood events from continuous hydrographs, we demonstrate how machine learning can be used to automate the application of expert knowledge to big data sets and extract the relevant information. In particular, we tested seven different algorithms to detect event beginning and end solely from a given excerpt from the continuous hydrograph. First, the number of required data points within the excerpts as well as the amount of training data has been determined. In a local application, we were able to show that all applied Machine learning algorithms were capable to reproduce manually defined event boundaries. Automatically delineated events were afflicted with a relative duration error of 20 and 5% event volume. Moreover, we could show that hydrograph separation patterns could easily be learned by the algorithms and are regionally and trans-regionally transferable without significant performance loss. Hence, the training data sets can be very small and trained algorithms can be applied to new catchments lacking training data. The results showed the great potential of machine learning to extract relevant information efficiently and, hence, lower the effort for data preprocessing for water management studies. Moreover, the transferability of trained algorithms to other catchments is a clear advantage to common methods.
d
Data from: KenSwQuAD – A Question Answering Dataset for Swahili Low Resource...
dataone.org
Updated Dec 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wanjawa, Barack; Wanzare, Lilian D.A.; Indede, Florence; McOnyango, Owen; Muchemi, Lawrence; Ombui, Edward (2023). KenSwQuAD – A Question Answering Dataset for Swahili Low Resource Language [Dataset]. http://doi.org/10.7910/DVN/OTL0LM
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/OTL0LM
Dataset updated
Dec 16, 2023
Dataset provided by
Harvard Dataverse
Authors
Wanjawa, Barack; Wanzare, Lilian D.A.; Indede, Florence; McOnyango, Owen; Muchemi, Lawrence; Ombui, Edward
Description
This research developed a Kencorpus Swahili Question Answering Dataset KenSwQuAD from raw data of Swahili language, which is a low resource language predominantly spoken in Eastern African and also has speakers in other parts of the world. Question Answering datasets are important for machine comprehension of natural language processing tasks such as internet search and dialog systems. However, before such machine learning systems can perform these tasks, they need training data such as the gold standard Question Answering (QA) set developed in this research. The research engaged annotators to formulate question answer pairs from Swahili texts that had been collected by the Kencorpus project, a Kenyan languages corpus that collected data from three Kenyan languages. The total Swahili data collection had 2,585 texts, out of which we annotated 1,445 story texts with at least 5 QA pairs each, resulting into a final dataset of 7,526 QA pairs. A quality assurance set of 12.5% of the annotated texts was subjected to re-evaluation by different annotators who confirmed that the QA pairs were all correctly annotated. A proof of concept on applying the set to machine learning on the question answering task confirmed that the dataset can be used for such practical tasks. The research therefore developed KenSwQuAD, a question-answer dataset for Swahili that is useful to the natural language processing community who need training and gold standard sets for their machine learning applications. The research also contributed to the resourcing of the Swahili language which is important for communication around the globe. Updating this set and providing similar sets for other low resource languages is an important research area that is worthy of further research. Acknowledgement of annotators: Rose Felynix Nyaboke, Alice Gachachi Muchemi, Patrick Ndung'u, Eric Omundi Magutu, Henry Masinde, Naomi Muthoni Gitau, Mark Bwire Erusmo, Victor Orembe Wandera, Frankline Owino, Geoffrey Sagwe Ombui
Dataset for Establishing a Reference Focal Plane Using Machine Learning and...
s.cnmilf.com
catalog.data.gov
Updated Feb 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2024). Dataset for Establishing a Reference Focal Plane Using Machine Learning and Beads for Brightfield Imaging [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/data-set-for-establishing-a-reference-focal-plane-using-machine-learning-and-beads-for-bri
Explore at:
Dataset updated
Feb 1, 2024
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This dataset consists of sets of images corresponding to the data sets 1-8 described in Table 1 in the manuscript "Establishing a Reference Focal Plane Using Machine Learning and Beads for Brightfield Imaging".Data sets from A2K contain two .zip folders: one with the .tiff images and one with the corresponding .txt file with live and dead cell concentration enumeration. The A2K instrument software collects 4 images per acquisition, and each of those images is passed through the A2K instrument's software algorithm which segments the live (green outline), dead (red outline), and debris (yellow outline) objects. Segmentation parameters are set by the user. This creates a total of 8 stored images per acquisition. When in proper focus and brightness, the V100 beads are segmented in green, appearing as live cells. In cases where the beads do not display the bright spot center (when out of focus or too dim) the software may segment the beads in red, as dead cells.Data sets from the Nikon contain .zip folders of .nd2 image stacks that can be opened with Image J.These image sets were used to develop the AI model to identify reference focal plane as described in the associated manuscript.
Data from: O-RAN with Machine Learning in ns-3
catalog.data.gov
data.nist.gov
Updated Mar 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2025). O-RAN with Machine Learning in ns-3 [Dataset]. https://catalog.data.gov/dataset/o-ran-with-machine-learning-in-ns-3
Explore at:
Dataset updated
Mar 14, 2025
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This dataset contains a comparison of packet loss counts vs handovers using four different methods: baseline, heuristic, distance, and machine learning, as well as the data used to train a machine learning model. This data was generated as a result of the work described in the paper, "O-RAN with Machine Learning in ns-3," by the authors Wesley Garey, Tanguy Ropitault, Richard Rouil, Evan Black, and Weichao Gao from the 2023 Workshop on ns-3 (WNS3 2023), that was June 28-29, 2023, in Arlington, VA, USA, and published by ACM, New York, NY, USA. This paper is accessible at https://doi.org/10.1145/3592149.3592157. This data set includes the data from "Figure 10: Simulation Results Comparing the Baseline with the Heuristic, Distance, and ML Approaches," "Figure 11: Simulation Results that Depict the Impact of Increasing the Link Delay of the E2 Interface," as well as the data set used to train the machine learning model that is discussed there.
machine-learning
kaggle.com
zip
Updated Feb 27, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kathirmani Sukumar (2019). machine-learning [Dataset]. https://www.kaggle.com/skathirmani/employeesattrition
Explore at:
zip(590910 bytes)Available download formats
Dataset updated
Feb 27, 2019
Authors
Kathirmani Sukumar
Description
Dataset

This dataset was created by Kathirmani Sukumar

Contents

It contains the following files:
Z
Data from: Gravity Spy Machine Learning Classifications of LIGO Glitches...
data.niaid.nih.gov
zenodo.org
Updated Jan 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Osterlund, Carsten (2023). Gravity Spy Machine Learning Classifications of LIGO Glitches from Observing Runs O1, O2, O3a, and O3b [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5649211
Explore at:
Dataset updated
Jan 30, 2023
Dataset provided by
Katsaggelos, Aggelos
Harandi, Mabi
Allen, Sara
Patane, Oli
Coughlin, Scott
Noroozi, Vahid
Banagari, Sharan
Soni, Siddharth
Osterlund, Carsten
Smith, Joshua
Kalogera, Vicky
Trouille, Laura
Crowston, Kevin
Rohani, Neda
Berry, Christopher
Glanzer, Jane
Bahaadini, Sara
Zevin, Michael
Jackson, Corey
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set contains all classifications that the Gravity Spy Machine Learning model for LIGO glitches from the first three observing runs (O1, O2 and O3, where O3 is split into O3a and O3b). Gravity Spy classified all noise events identified by the Omicron trigger pipeline in which Omicron identified that the signal-to-noise ratio was above 7.5 and the peak frequency of the noise event was between 10 Hz and 2048 Hz. To classify noise events, Gravity Spy made Omega scans of every glitch consisting of 4 different durations, which helps capture the morphology of noise events that are both short and long in duration.

There are 22 classes used for O1 and O2 data (including No_Glitch and None_of_the_Above), while there are two additional classes used to classify O3 data (while None_of_the_Above was removed).

For O1 and O2, the glitch classes were: 1080Lines, 1400Ripples, Air_Compressor, Blip, Chirp, Extremely_Loud, Helix, Koi_Fish, Light_Modulation, Low_Frequency_Burst, Low_Frequency_Lines, No_Glitch, None_of_the_Above, Paired_Doves, Power_Line, Repeating_Blips, Scattered_Light, Scratchy, Tomte, Violin_Mode, Wandering_Line, Whistle

For O3, the glitch classes were: 1080Lines, 1400Ripples, Air_Compressor, Blip, Blip_Low_Frequency, Chirp, Extremely_Loud, Fast_Scattering, Helix, Koi_Fish, Light_Modulation, Low_Frequency_Burst, Low_Frequency_Lines, No_Glitch, None_of_the_Above, Paired_Doves, Power_Line, Repeating_Blips, Scattered_Light, Scratchy, Tomte, Violin_Mode, Wandering_Line, Whistle

The data set is described in Glanzer et al. (2023), which we ask to be cited in any publications using this data release. Example code using the data can be found in this Colab notebook.

If you would like to download the Omega scans associated with each glitch, then you can use the gravitational-wave data-analysis tool GWpy. If you would like to use this tool, please install anaconda if you have not already and create a virtual environment using the following command

conda create --name gravityspy-py38 -c conda-forge python=3.8 gwpy pandas psycopg2 sqlalchemy

After downloading one of the CSV files for a specific era and interferometer, please run the following Python script if you would like to download the data associated with the metadata in the CSV file. We recommend not trying to download too many images at one time. For example, the script below will read data on Hanford glitches from O2 that were classified by Gravity Spy and filter for only glitches that were labelled as Blips with 90% confidence or higher, and then download the first 4 rows of the filtered table.

from gwpy.table import GravitySpyTable

H1_O2 = GravitySpyTable.read('H1_O2.csv')

H1_O2[(H1_O2["ml_label"] == "Blip") & (H1_O2["ml_confidence"] > 0.9)]

H1_O2[0:4].download(nproc=1)

Each of the columns in the CSV files are taken from various different inputs:

[‘event_time’, ‘ifo’, ‘peak_time’, ‘peak_time_ns’, ‘start_time’, ‘start_time_ns’, ‘duration’, ‘peak_frequency’, ‘central_freq’, ‘bandwidth’, ‘channel’, ‘amplitude’, ‘snr’, ‘q_value’] contain metadata about the signal from the Omicron pipeline.

[‘gravityspy_id’] is the unique identifier for each glitch in the dataset.

[‘1400Ripples’, ‘1080Lines’, ‘Air_Compressor’, ‘Blip’, ‘Chirp’, ‘Extremely_Loud’, ‘Helix’, ‘Koi_Fish’, ‘Light_Modulation’, ‘Low_Frequency_Burst’, ‘Low_Frequency_Lines’, ‘No_Glitch’, ‘None_of_the_Above’, ‘Paired_Doves’, ‘Power_Line’, ‘Repeating_Blips’, ‘Scattered_Light’, ‘Scratchy’, ‘Tomte’, ‘Violin_Mode’, ‘Wandering_Line’, ‘Whistle’] contain the machine learning confidence for a glitch being in a particular Gravity Spy class (the confidence in all these columns should sum to unity). These use the original 22 classes in all cases.

[‘ml_label’, ‘ml_confidence’] provide the machine-learning predicted label for each glitch, and the machine learning confidence in its classification.

[‘url1’, ‘url2’, ‘url3’, ‘url4’] are the links to the publicly-available Omega scans for each glitch. ‘url1’ shows the glitch for a duration of 0.5 seconds, ‘url2’ for 1 seconds, ‘url3’ for 2 seconds, and ‘url4’ for 4 seconds.

For the most recently uploaded training set used in Gravity Spy machine learning algorithms, please see Gravity Spy Training Set on Zenodo.

For detailed information on the training set used for the original Gravity Spy machine learning paper, please see Machine learning for Gravity Spy: Glitch classification and dataset on Zenodo.
a
UCI Machine Learning Datasets 12/2013
academictorrents.com
bittorrent
Updated Dec 20, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCI (2013). UCI Machine Learning Datasets 12/2013 [Dataset]. https://academictorrents.com/details/7fafb101f9c7961f9b840daeb4af43039107ddef
Explore at:
bittorrent(16365432846)Available download formats
Dataset updated
Dec 20, 2013
Dataset authored and provided by
UCI
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine. Since that time, it has been widely used by students, educators, and researchers all over the world as a primary source of machine learning data sets. As an indication of the impact of the archive, it has been cited over 1000 times, making it one of the top 100 most cited "papers" in all of computer science. The current version of the web site was designed in 2007 by Arthur Asuncion and David Newman, and this project is in collaboration with Rexa.info at the University of Massachusetts Amherst. Funding support from the National Science Foundation is gratefully acknowledged. Many people deserve thanks for making the repository a success. Foremost among them are the d

IoMT-TrafficData: A Dataset for Benchmarking Intrusion Detection in IoMT

zenodo.org
data.niaid.nih.gov

Updated Aug 30, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

José Areia; José Areia; Ivo Afonso Bispo; Ivo Afonso Bispo; Leonel Santos; Leonel Santos; Rogério Luís Costa; Rogério Luís Costa (2024). IoMT-TrafficData: A Dataset for Benchmarking Intrusion Detection in IoMT [Dataset]. http://doi.org/10.5281/zenodo.8116338

Explore at:

Unique identifier

https://doi.org/10.5281/zenodo.8116338

Dataset updated

Aug 30, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

José Areia; José Areia; Ivo Afonso Bispo; Ivo Afonso Bispo; Leonel Santos; Leonel Santos; Rogério Luís Costa; Rogério Luís Costa

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Article Information

The work involved in developing the dataset and benchmarking its use of machine learning is set out in the article ‘IoMT-TrafficData: Dataset and Tools for Benchmarking Intrusion Detection in Internet of Medical Things’. DOI: 10.1109/ACCESS.2024.3437214.

Please do cite the aforementioned article when using this dataset.

Abstract

The increasing importance of securing the Internet of Medical Things (IoMT) due to its vulnerabilities to cyber-attacks highlights the need for an effective intrusion detection system (IDS). In this study, our main objective was to develop a Machine Learning Model for the IoMT to enhance the security of medical devices and protect patients’ private data. To address this issue, we built a scenario that utilised the Internet of Things (IoT) and IoMT devices to simulate real-world attacks. We collected and cleaned data, pre-processed it, and provided it into our machine-learning model to detect intrusions in the network. Our results revealed significant improvements in all performance metrics, indicating robustness and reproducibility in real-world scenarios. This research has implications in the context of IoMT and cybersecurity, as it helps mitigate vulnerabilities and lowers the number of breaches occurring with the rapid growth of IoMT devices. The use of machine learning algorithms for intrusion detection systems is essential, and our study provides valuable insights and a road map for future research and the deployment of such systems in live environments. By implementing our findings, we can contribute to a safer and more secure IoMT ecosystem, safeguarding patient privacy and ensuring the integrity of medical data.

ZIP Folder Content

The ZIP folder comprises two main components: Captures and Datasets. Within the captures folder, we have included all the captures used in this project. These captures are organized into separate folders corresponding to the type of network analysis: BLE or IP-Based. Similarly, the datasets folder follows a similar organizational approach. It contains datasets categorized by type: BLE, IP-Based Packet, and IP-Based Flows.

To cater to diverse analytical needs, the datasets are provided in two formats: CSV (Comma-Separated Values) and pickle. The CSV format facilitates seamless integration with various data analysis tools, while the pickle format preserves the intricate structures and relationships within the dataset.

This organization enables researchers to easily locate and utilize the specific captures and datasets they require, based on their preferred network analysis type or dataset type. The availability of different formats further enhances the flexibility and usability of the provided data.

Datasets' Content

Within this dataset, three sub-datasets are available, namely BLE, IP-Based Packet, and IP-Based Flows. Below is a table of the features selected for each dataset and consequently used in the evaluation model within the provided work.

Identified Key Features Within Bluetooth Dataset

Feature	Meaning
btle.advertising_header	BLE Advertising Packet Header
btle.advertising_header.ch_sel	BLE Advertising Channel Selection Algorithm
btle.advertising_header.length	BLE Advertising Length
btle.advertising_header.pdu_type	BLE Advertising PDU Type
btle.advertising_header.randomized_rx	BLE Advertising Rx Address
btle.advertising_header.randomized_tx	BLE Advertising Tx Address
btle.advertising_header.rfu.1	Reserved For Future 1
btle.advertising_header.rfu.2	Reserved For Future 2
btle.advertising_header.rfu.3	Reserved For Future 3
btle.advertising_header.rfu.4	Reserved For Future 4
btle.control.instant	Instant Value Within a BLE Control Packet
btle.crc.incorrect	Incorrect CRC
btle.extended_advertising	Advertiser Data Information
btle.extended_advertising.did	Advertiser Data Identifier
btle.extended_advertising.sid	Advertiser Set Identifier
btle.length	BLE Length
frame.cap_len	Frame Length Stored Into the Capture File
frame.interface_id	Interface ID
frame.len	Frame Length Wire
nordic_ble.board_id	Board ID
nordic_ble.channel	Channel Index
nordic_ble.crcok	Indicates if CRC is Correct
nordic_ble.flags	Flags
nordic_ble.packet_counter	Packet Counter
nordic_ble.packet_time	Packet time (start to end)
nordic_ble.phy	PHY
nordic_ble.protover	Protocol Version

Identified Key Features Within IP-Based Packets Dataset

Feature	Meaning
http.content_length	Length of content in an HTTP response
http.request	HTTP request being made
http.response.code	Sequential number of an HTTP response
http.response_number	Sequential number of an HTTP response
http.time	Time taken for an HTTP transaction
tcp.analysis.initial_rtt	Initial round-trip time for TCP connection
tcp.connection.fin	TCP connection termination with a FIN flag
tcp.connection.syn	TCP connection initiation with SYN flag
tcp.connection.synack	TCP connection establishment with SYN-ACK flags
tcp.flags.cwr	Congestion Window Reduced flag in TCP
tcp.flags.ecn	Explicit Congestion Notification flag in TCP
tcp.flags.fin	FIN flag in TCP
tcp.flags.ns	Nonce Sum flag in TCP
tcp.flags.res	Reserved flags in TCP
tcp.flags.syn	SYN flag in TCP
tcp.flags.urg	Urgent flag in TCP
tcp.urgent_pointer	Pointer to urgent data in TCP
ip.frag_offset	Fragment offset in IP packets
eth.dst.ig	Ethernet destination is in the internal network group
eth.src.ig	Ethernet source is in the internal network group
eth.src.lg	Ethernet source is in the local network group
eth.src_not_group	Ethernet source is not in any network group
arp.isannouncement	Indicates if an ARP message is an announcement

Identified Key Features Within IP-Based Flows Dataset

Feature	Meaning
proto	Transport layer protocol of the connection
service	Identification of an application protocol
orig_bytes	Originator payload bytes
resp_bytes	Responder payload bytes
history	Connection state history
orig_pkts	Originator sent packets
resp_pkts	Responder sent packets
flow_duration	Length of the flow in seconds
fwd_pkts_tot	Forward packets total
bwd_pkts_tot	Backward packets total
fwd_data_pkts_tot	Forward data packets total
bwd_data_pkts_tot	Backward data packets total
fwd_pkts_per_sec	Forward packets per second
bwd_pkts_per_sec	Backward packets per second
flow_pkts_per_sec	Flow packets per second
fwd_header_size	Forward header bytes
bwd_header_size	Backward header bytes
fwd_pkts_payload	Forward payload bytes
bwd_pkts_payload	Backward payload bytes
flow_pkts_payload	Flow payload bytes
fwd_iat	Forward inter-arrival time
bwd_iat	Backward inter-arrival time
flow_iat	Flow inter-arrival time
active	Flow active duration

UCI and OpenML Data Sets for Ordinal Quantification
zenodo.org
data.niaid.nih.gov
zip
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8177302
Dataset updated
Jul 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
i
A Batch of Integer Data Sets for Clustering Algorithms
ieee-dataport.org
Updated May 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nuno Paulino (2022). A Batch of Integer Data Sets for Clustering Algorithms [Dataset]. https://ieee-dataport.org/documents/batch-integer-data-sets-clustering-algorithms
Explore at:
Dataset updated
May 18, 2022
Authors
Nuno Paulino
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
or k-means.
MSL Curiosity Rover Images with Science and Engineering Classes
zenodo.org
explore.openaire.eu
+1more
zip
Updated Sep 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Steven Lu; Steven Lu; Kiri L. Wagstaff; Kiri L. Wagstaff (2020). MSL Curiosity Rover Images with Science and Engineering Classes [Dataset]. http://doi.org/10.5281/zenodo.4033453
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4033453
Dataset updated
Sep 17, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Steven Lu; Steven Lu; Kiri L. Wagstaff; Kiri L. Wagstaff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please note that the file msl-labeled-data-set-v2.1.zip below contains the latest images and labels associated with this data set.

Data Set Description

The data set consists of 6,820 images that were collected by the Mars Science Laboratory (MSL) Curiosity Rover by three instruments: (1) the Mast Camera (Mastcam) Left Eye; (2) the Mast Camera Right Eye; (3) the Mars Hand Lens Imager (MAHLI). With the help from Dr. Raymond Francis, a member of the MSL operations team, we identified 19 classes with science and engineering interests (see the "Classes" section for more information), and each image is assigned with 1 class label. We split the data set into training, validation, and test sets in order to train and evaluate machine learning algorithms. The training set contains 5,920 images (including augmented images; see the "Image Augmentation" section for more information); the validation set contains 300 images; the test set contains 600 images. The training set images were randomly sampled from sol (Martian day) range 1 - 948; validation set images were randomly sampled from sol range 949 - 1920; test set images were randomly sampled from sol range 1921 - 2224. All images are resized to 227 x 227 pixels without preserving the original height/width aspect ratio.

Directory Contents

images - contains all 6,820 images

class_map.csv - string-integer class mappings

train-set-v2.1.txt - label file for the training set

val-set-v2.1.txt - label file for the validation set

test-set-v2.1.txt - label file for the test set

The label files are formatted as below:

"Image-file-name class_in_integer_representation"

Labeling Process

Each image was labeled with help from three different volunteers (see Contributor list). The final labels are determined using the following processes:

If all three labels agree with each other, then use the label as the final label.

If the three labels do not agree with each other, then we manually review the labels and decide the final label.

We also performed error analysis to correct labels as a post-processing step in order to remove noisy/incorrect labels in the data set.

Classes

There are 19 classes identified in this data set. In order to simplify our training and evaluation algorithms, we mapped the class names from string to integer representations. The names of classes, string-integer mappings, distributions are shown below:

Class name, counts (training set), counts (validation set), counts (test set), integer representation

Arm cover, 10, 1, 4, 0

Other rover part, 190, 11, 10, 1

Artifact, 680, 62, 132, 2

Nearby surface, 1554, 74, 187, 3

Close-up rock, 1422, 50, 84, 4

DRT, 8, 4, 6, 5

DRT spot, 214, 1, 7, 6

Distant landscape, 342, 14, 34, 7

Drill hole, 252, 5, 12, 8

Night sky, 40, 3, 4, 9

Float, 190, 5, 1, 10

Layers, 182, 21, 17, 11

Light-toned veins, 42, 4, 27, 12

Mastcam cal target, 122, 12, 29, 13

Sand, 228, 19, 16, 14

Sun, 182, 5, 19, 15

Wheel, 212, 5, 5, 16

Wheel joint, 62, 1, 5, 17

Wheel tracks, 26, 3, 1, 18

Image Augmentation

Only the training set contains augmented images. 3,920 of the 5,920 images in the training set are augmented versions of the remaining 2000 original training images. Images taken by different instruments were augmented differently. As shown below, we employed 5 different methods to augment images. Images taken by the Mastcam left and right eye cameras were augmented using a horizontal flipping method, and images taken by the MAHLI camera were augmented using all 5 methods. Note that one can filter based on the file names listed in the train-set.txt file to obtain a set of non-augmented images.

90 degrees clockwise rotation (file name ends with -r90.jpg)

180 degrees clockwise rotation (file name ends with -r180.jpg)

270 degrees clockwise rotation (file name ends with -r270.jpg)

Horizontal flip (file name ends with -fh.jpg)

Vertical flip (file name ends with -fv.jpg)

Acknowledgment

The authors would like to thank the volunteers (as in the Contributor list) who provided annotations for this data set. We would also like to thank the PDS Imaging Note for the continuous support of this work.
d
Dataset for: An integrated simulator and data set that combines grasping and...
search.dataone.org
borealisdata.ca
+1more
Updated Nov 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Veres, Matthew; Moussa, Medhat; Taylor, Graham (2024). Dataset for: An integrated simulator and data set that combines grasping and vision for deep learning [Dataset]. http://doi.org/10.5683/SP/KL5P5S
Explore at:
Unique identifier
https://doi.org/10.5683/SP/KL5P5S
Dataset updated
Nov 6, 2024
Dataset provided by
Borealis
Authors
Veres, Matthew; Moussa, Medhat; Taylor, Graham
Description
To develop a simulation that collects both visual information, as well as grasp information about different objects using a multi-fingered hand. These sources of data can be used in the future to learn integrated object-action grasp representations.
o
BoolQ: Question Answering Dataset
opendatabay.com
.undefined
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). BoolQ: Question Answering Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/0aa8f4c4-227b-48ab-8294-fafde5cb3afe
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 6, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
The BoolQ dataset is a valuable resource crafted for question answering tasks. It is organised into two main splits: a validation split and a training split. The primary aim of this dataset is to facilitate research in natural language processing (NLP) and machine learning (ML), particularly in tasks involving the answering of questions based on provided text. It offers a rich collection of user-posed questions, their corresponding answers, and the passages from which these answers are derived. This enables researchers to develop and evaluate models for real-world scenarios where information needs to be retrieved or understood from textual sources.

Columns

question: This column contains the specific questions posed by users. It provides insight into the information that needs to be extracted from the given passage.

answer: This column holds the correct answers to each corresponding question in the dataset. The objective is to build models that can accurately predict these answers. The 'answer' column includes Boolean values, with true appearing 5,874 times (62%) and false appearing 3,553 times (38%).

passage: This column serves as the context or background information from which questions are formulated and answers must be located.

Distribution

The BoolQ dataset consists of two main parts: a validation split and a training split. Both splits feature consistent data fields: question, answer, and passage. The train.csv file, for example, is part of the training data. While specific row or record counts are not detailed for the entire dataset, the 'answer' column uniquely features 9,427 boolean values.

Usage

This dataset is ideally suited for: * Question Answering Systems: Training models to identify correct answers from multiple choices, given a question and a passage. * Machine Reading Comprehension: Developing models that can understand and interpret written text effectively. * Information Retrieval: Enabling models to retrieve relevant passages or documents that contain answers to a given query or question.

Coverage

The sources do not specify the geographic, time range, or demographic scope of the data.

License

CC0

Who Can Use It

The BoolQ dataset is primarily intended for researchers and developers working in artificial intelligence fields such as Natural Language Processing (NLP) and Machine Learning (ML). It is particularly useful for those building or evaluating: * Question answering algorithms * Information retrieval systems * Machine reading comprehension models

Dataset Name Suggestions

BoolQ: Question Answering Dataset

Text-Based Question Answering Corpus

NLP Question-Answer-Passage Data

Machine Reading Comprehension BoolQ

Boolean Question Answering Data

Attributes

Original Data Source: BoolQ - Question-Answer-Passage Consistency
i
Integrated Household Living Conditions Survey 2010-2011 ; Subset for Machine...
catalog.ihsn.org
microdata.worldbank.org
Updated Sep 19, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Statistical Office (NSO) (2018). Integrated Household Living Conditions Survey 2010-2011 ; Subset for Machine Learning Comparative Assessment Project - Malawi [Dataset]. https://catalog.ihsn.org/index.php/catalog/7445
Explore at:
Dataset updated
Sep 19, 2018
Dataset authored and provided by
National Statistical Office (NSO)
Time period covered
2010 - 2011
Area covered
Malawi
Description
Abstract

This dataset contains a set of data files used as input for a World Bank research project (empirical comparative assessment of machine learning algorithms applied to poverty prediction). The objective of the project was to compare the performance of a series of classification algorithms. The dataset contains variables at the household, individual, and community levels. The variables selected to serve as potential predictors in the machine learning models are all qualitative variables (except for the household size). Information on household consumption is included, but in the form of dummy variables (indicating whether the household consumed or not each specific product or service listed in the survey questionnaire). The household-level data file contains the variables "Poor / Non poor" which served as the predicted variable ("label") in the models.

One of the data files included in the dataset contains data on household consumption (amounts) by main categories of products and services. This data file was not used in the prediction model. It is used only for the purpose of analyzing the models mis-classifications (in particular, to identify how far the mis-classified households are from the national poverty line).

These datasets are provided to allow interested users to replicate the analysis done for the project using Python 3 (a collection of Jupyter Notebooks containing the documented scripts is openly available on GitHub).

Geographic coverage

National

Analysis unit

Households

Individuals

Communities

Kind of data

Sample survey data [ssd]

Sampling procedure

The IHS3 sampling frame is based on the listing information and cartography from the 2008 Malawi Population and Housing Census (PHC); includes the three major regions of Malawi, namely North, Center and South; and is stratified into rural and urban strata. The urban strata include the four major urban areas: Lilongwe City, Blantyre City, Mzuzu City, and the Municipality of Zomba. All other areas are considered as rural areas, and each of the 27 districts were considered as a separate sub-stratum as part of the main rural stratum. It was decided to exclude the island district of Likoma from the IHS3 sampling frame, since it only represents about 0.1% of the population of Malawi, and the corresponding cost of enumeration would be relatively high. The sampling frame further excludes the population living in institutions, such as hospitals, prisons and military barracks. Hence, the IHS3 strata are composed of 31 districts in Malawi.

A stratified two-stage sample design was used for the IHS3.

Mode of data collection

Face-to-face [f2f]

Research instrument

The survey was collectd using four questionnaires: 1) Household Questionnaire 2) Agriculture Questionnaire 3) Fishery Questionnaire 4) Community Questionnaire
i
Data set for various metal types
ieee-dataport.org
Updated Jun 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RADHAMADHAB DALAI (2020). Data set for various metal types [Dataset]. https://ieee-dataport.org/open-access/data-set-various-metal-types
Explore at:
Dataset updated
Jun 25, 2020
Authors
RADHAMADHAB DALAI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
scaled and modified to represent a number a training set dataset.It can be used to detect and identify object type based on material type in the image.In this process both training data set and test data set can be generated from these image files.

Facebook

Twitter

Click to copy link

Link copied

Cite

Wil Gardner (2024). Data set for article: Effect of data preprocessing and machine learning hyperparameters on mass spectrometry imaging models [Dataset]. http://doi.org/10.26181/22671022.v1

Data set for article: Effect of data preprocessing and machine learning hyperparameters on mass spectrometry imaging models

Explore at:

hdfAvailable download formats

Unique identifier

https://doi.org/10.26181/22671022.v1

Dataset updated

Mar 7, 2024

Dataset provided by

La Trobe

Authors

Wil Gardner

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

This data set is uploaded as supporting information for the publication entitled:Effect of data preprocessing and machine learning hyperparameters on mass spectrometry imaging modelsFiles are as follows:polymer_microarray_data.mat - MATLAB workspace file containing peak-picked ToF-SIMS data (hyperspectral array) for the polymer microarray sample.nylon_data.mat - MATLAB workspace file containing m/z binned ToF-SIMS data (hyperspectral array) for the semi-synthetic nylon data set, generated from 7 nylon samples.Additional details about the datasets can be found in the published article.If you use this data set in your work, please cite our work as follows:Cite as: Gardner et al.. J. Vac. Sci. Technol. A 41, 000000 (2023); doi: 10.1116/6.0002788

Clear search

Close search

Google apps

Main menu

Data set for article: Effect of data preprocessing and machine learning...

Data from: Count-Based Morgan Fingerprint: A More Efficient and...

Unimelb Corridor Synthetic dataset

Training dataset for NABat Machine Learning V1.0

GNSS-RO Machine Learning Feature Sets used for Classification of Cubesat...

General Description:

Included in this dataset:

References:

Data_Sheet_2_On the Automation of Flood Event Separation From Continuous...

Data from: KenSwQuAD – A Question Answering Dataset for Swahili Low Resource...

Dataset for Establishing a Reference Focal Plane Using Machine Learning and...

Data from: O-RAN with Machine Learning in ns-3

machine-learning

Dataset

Contents

Data from: Gravity Spy Machine Learning Classifications of LIGO Glitches...

UCI Machine Learning Datasets 12/2013

IoMT-TrafficData: A Dataset for Benchmarking Intrusion Detection in IoMT

Article Information

Abstract

ZIP Folder Content

Datasets' Content

UCI and OpenML Data Sets for Ordinal Quantification

A Batch of Integer Data Sets for Clustering Algorithms

MSL Curiosity Rover Images with Science and Engineering Classes

Dataset for: An integrated simulator and data set that combines grasping and...

BoolQ: Question Answering Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Integrated Household Living Conditions Survey 2010-2011 ; Subset for Machine...

Abstract

Geographic coverage

Analysis unit

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Data set for various metal types

Data set for article: Effect of data preprocessing and machine learning hyperparameters on mass spectrometry imaging models