Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This data set is uploaded as supporting information for the publication entitled:Effect of data preprocessing and machine learning hyperparameters on mass spectrometry imaging modelsFiles are as follows:polymer_microarray_data.mat - MATLAB workspace file containing peak-picked ToF-SIMS data (hyperspectral array) for the polymer microarray sample.nylon_data.mat - MATLAB workspace file containing m/z binned ToF-SIMS data (hyperspectral array) for the semi-synthetic nylon data set, generated from 7 nylon samples.Additional details about the datasets can be found in the published article.If you use this data set in your work, please cite our work as follows:Cite as: Gardner et al.. J. Vac. Sci. Technol. A 41, 000000 (2023); doi: 10.1116/6.0002788
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
In this study, we introduce the count-based Morgan fingerprint (C-MF) to represent chemical structures of contaminants and develop machine learning (ML)-based predictive models for their activities and properties. Compared with the binary Morgan fingerprint (B-MF), C-MF not only qualifies the presence or absence of an atom group but also quantifies its counts in a molecule. We employ six different ML algorithms (ridge regression, SVM, KNN, RF, XGBoost, and CatBoost) to develop models on 10 contaminant-related data sets based on C-MF and B-MF to compare them in terms of the model’s predictive performance, interpretation, and applicability domain (AD). Our results show that C-MF outperforms B-MF in nine of 10 data sets in terms of model predictive performance. The advantage of C-MF over B-MF is dependent on the ML algorithm, and the performance enhancements are proportional to the difference in the chemical diversity of data sets calculated by B-MF and C-MF. Model interpretation results show that the C-MF-based model can elucidate the effect of atom group counts on the target and have a wider range of SHAP values. AD analysis shows that C-MF-based models have an AD similar to that of B-MF-based ones. Finally, we developed a “ContaminaNET” platform to deploy these C-MF-based models for free use.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data-set is a supplementary material related to the generation of synthetic images of a corridor in the University of Melbourne, Australia from a building information model (BIM). This data-set was generated to check the ability of deep learning algorithms to learn task of indoor localisation from synthetic images, when being tested on real images. =============================================================================The following is the name convention used for the data-sets. The brackets show the number of images in the data-set.REAL DATAReal
---------------------> Real images (949 images)
Gradmag-Real -------> Gradmag of real data
(949 images)SYNTHETIC DATASyn-Car
----------------> Cartoonish images (2500 images)
Syn-pho-real ----------> Synthetic photo-realistic images (2500 images)
Syn-pho-real-tex -----> Synthetic photo-realistic textured (2500 images)
Syn-Edge --------------> Edge render images (2500 images)
Gradmag-Syn-Car ---> Gradmag of Cartoonish images (2500 images)=============================================================================Each folder contains the images and their respective groundtruth poses in the following format [ImageName X Y Z w p q r].To generate the synthetic data-set, we define a trajectory in the 3D indoor model. The points in the trajectory serve as the ground truth poses of the synthetic images. The height of the trajectory was kept in the range of 1.5–1.8 m from the floor, which is the usual height of holding a camera in hand. Artificial point light sources were placed to illuminate the corridor (except for Edge render images). The length of the trajectory was approximately 30 m. A virtual camera was moved along the trajectory to render four different sets of synthetic images in Blender*. The intrinsic parameters of the virtual camera were kept identical to the real camera (VGA resolution, focal length of 3.5 mm, no distortion modeled). We have rendered images along the trajectory at 0.05 m interval and ± 10° tilt.The main difference between the cartoonish (Syn-car) and photo-realistic images (Syn-pho-real) is the model of rendering. Photo-realistic rendering is a physics-based model that traces the path of light rays in the scene, which is similar to the real world, whereas the cartoonish rendering roughly traces the path of light rays. The photorealistic textured images (Syn-pho-real-tex) were rendered by adding repeating synthetic textures to the 3D indoor model, such as the textures of brick, carpet and wooden ceiling. The realism of the photo-realistic rendering comes at the cost of rendering times. However, the rendering times of the photo-realistic data-sets were considerably reduced with the help of a GPU. Note that the naming convention used for the data-sets (e.g. Cartoonish) is according to Blender terminology.An additional data-set (Gradmag-Syn-car) was derived from the cartoonish images by taking the edge gradient magnitude of the images and suppressing weak edges below a threshold. The edge rendered images (Syn-edge) were generated by rendering only the edges of the 3D indoor model, without taking into account the lighting conditions. This data-set is similar to the Gradmag-Syn-car data-set, however, does not contain the effect of illumination of the scene, such as reflections and shadows.*Blender is an open-source 3D computer graphics software and finds its applications in video games, animated films, simulation and visual art. For more information please visit: http://www.blender.orgPlease cite the papers if you use the data-set:1) Acharya, D., Khoshelham, K., and Winter, S., 2019. BIM-PoseNet: Indoor camera localisation using a 3D indoor model and deep learning from synthetic images. ISPRS Journal of Photogrammetry and Remote Sensing. 150: 245-258.2) Acharya, D., Singha Roy, S., Khoshelham, K. and Winter, S. 2019. Modelling uncertainty of single image indoor localisation using a 3D model and deep learning. In ISPRS Annals of Photogrammetry, Remote Sensing & Spatial Information Sciences, IV-2/W5, pages 247-254.
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes feature sets extracted from GNSS-RO profiles used for multiclass classifcation model training and testing the classifier from
Dittmann, Chang, & Morton (202?) Machine Learning Classification of Ionosphere and RFI Disturbances in Spaceborne GNSS Radio Occultation Measurements.
In this work we apply a combination of physics-based feature engineering with data-driven supervised machine learning to improve classification of low earth orbit Spire Global GNSS radio occultation disturbances.
data
├── converted_labels.pkl #(feature set catalogs)
├── **.pkl
└── data
├── feature_set_all_single_file
│ └── all_fdf_v2.pkl #(6 months of feature sets concatenated into single object)
└── feature_sets
├── 2022.206.117.01.01.G23.SC001_0001.pkl #(individual profile feature sets)
├── 202***.pkl
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Can machine learning effectively lower the effort necessary to extract important information from raw data for hydrological research questions? On the example of a typical water-management task, the extraction of direct runoff flood events from continuous hydrographs, we demonstrate how machine learning can be used to automate the application of expert knowledge to big data sets and extract the relevant information. In particular, we tested seven different algorithms to detect event beginning and end solely from a given excerpt from the continuous hydrograph. First, the number of required data points within the excerpts as well as the amount of training data has been determined. In a local application, we were able to show that all applied Machine learning algorithms were capable to reproduce manually defined event boundaries. Automatically delineated events were afflicted with a relative duration error of 20 and 5% event volume. Moreover, we could show that hydrograph separation patterns could easily be learned by the algorithms and are regionally and trans-regionally transferable without significant performance loss. Hence, the training data sets can be very small and trained algorithms can be applied to new catchments lacking training data. The results showed the great potential of machine learning to extract relevant information efficiently and, hence, lower the effort for data preprocessing for water management studies. Moreover, the transferability of trained algorithms to other catchments is a clear advantage to common methods.
This research developed a Kencorpus Swahili Question Answering Dataset KenSwQuAD from raw data of Swahili language, which is a low resource language predominantly spoken in Eastern African and also has speakers in other parts of the world. Question Answering datasets are important for machine comprehension of natural language processing tasks such as internet search and dialog systems. However, before such machine learning systems can perform these tasks, they need training data such as the gold standard Question Answering (QA) set developed in this research. The research engaged annotators to formulate question answer pairs from Swahili texts that had been collected by the Kencorpus project, a Kenyan languages corpus that collected data from three Kenyan languages. The total Swahili data collection had 2,585 texts, out of which we annotated 1,445 story texts with at least 5 QA pairs each, resulting into a final dataset of 7,526 QA pairs. A quality assurance set of 12.5% of the annotated texts was subjected to re-evaluation by different annotators who confirmed that the QA pairs were all correctly annotated. A proof of concept on applying the set to machine learning on the question answering task confirmed that the dataset can be used for such practical tasks. The research therefore developed KenSwQuAD, a question-answer dataset for Swahili that is useful to the natural language processing community who need training and gold standard sets for their machine learning applications. The research also contributed to the resourcing of the Swahili language which is important for communication around the globe. Updating this set and providing similar sets for other low resource languages is an important research area that is worthy of further research. Acknowledgement of annotators: Rose Felynix Nyaboke, Alice Gachachi Muchemi, Patrick Ndung'u, Eric Omundi Magutu, Henry Masinde, Naomi Muthoni Gitau, Mark Bwire Erusmo, Victor Orembe Wandera, Frankline Owino, Geoffrey Sagwe Ombui
This dataset consists of sets of images corresponding to the data sets 1-8 described in Table 1 in the manuscript "Establishing a Reference Focal Plane Using Machine Learning and Beads for Brightfield Imaging".Data sets from A2K contain two .zip folders: one with the .tiff images and one with the corresponding .txt file with live and dead cell concentration enumeration. The A2K instrument software collects 4 images per acquisition, and each of those images is passed through the A2K instrument's software algorithm which segments the live (green outline), dead (red outline), and debris (yellow outline) objects. Segmentation parameters are set by the user. This creates a total of 8 stored images per acquisition. When in proper focus and brightness, the V100 beads are segmented in green, appearing as live cells. In cases where the beads do not display the bright spot center (when out of focus or too dim) the software may segment the beads in red, as dead cells.Data sets from the Nikon contain .zip folders of .nd2 image stacks that can be opened with Image J.These image sets were used to develop the AI model to identify reference focal plane as described in the associated manuscript.
This dataset contains a comparison of packet loss counts vs handovers using four different methods: baseline, heuristic, distance, and machine learning, as well as the data used to train a machine learning model. This data was generated as a result of the work described in the paper, "O-RAN with Machine Learning in ns-3," by the authors Wesley Garey, Tanguy Ropitault, Richard Rouil, Evan Black, and Weichao Gao from the 2023 Workshop on ns-3 (WNS3 2023), that was June 28-29, 2023, in Arlington, VA, USA, and published by ACM, New York, NY, USA. This paper is accessible at https://doi.org/10.1145/3592149.3592157. This data set includes the data from "Figure 10: Simulation Results Comparing the Baseline with the Heuristic, Distance, and ML Approaches," "Figure 11: Simulation Results that Depict the Impact of Increasing the Link Delay of the E2 Interface," as well as the data set used to train the machine learning model that is discussed there.
This dataset was created by Kathirmani Sukumar
It contains the following files:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains all classifications that the Gravity Spy Machine Learning model for LIGO glitches from the first three observing runs (O1, O2 and O3, where O3 is split into O3a and O3b). Gravity Spy classified all noise events identified by the Omicron trigger pipeline in which Omicron identified that the signal-to-noise ratio was above 7.5 and the peak frequency of the noise event was between 10 Hz and 2048 Hz. To classify noise events, Gravity Spy made Omega scans of every glitch consisting of 4 different durations, which helps capture the morphology of noise events that are both short and long in duration.
There are 22 classes used for O1 and O2 data (including No_Glitch and None_of_the_Above), while there are two additional classes used to classify O3 data (while None_of_the_Above was removed).
For O1 and O2, the glitch classes were: 1080Lines, 1400Ripples, Air_Compressor, Blip, Chirp, Extremely_Loud, Helix, Koi_Fish, Light_Modulation, Low_Frequency_Burst, Low_Frequency_Lines, No_Glitch, None_of_the_Above, Paired_Doves, Power_Line, Repeating_Blips, Scattered_Light, Scratchy, Tomte, Violin_Mode, Wandering_Line, Whistle
For O3, the glitch classes were: 1080Lines, 1400Ripples, Air_Compressor, Blip, Blip_Low_Frequency, Chirp, Extremely_Loud, Fast_Scattering, Helix, Koi_Fish, Light_Modulation, Low_Frequency_Burst, Low_Frequency_Lines, No_Glitch, None_of_the_Above, Paired_Doves, Power_Line, Repeating_Blips, Scattered_Light, Scratchy, Tomte, Violin_Mode, Wandering_Line, Whistle
The data set is described in Glanzer et al. (2023), which we ask to be cited in any publications using this data release. Example code using the data can be found in this Colab notebook.
If you would like to download the Omega scans associated with each glitch, then you can use the gravitational-wave data-analysis tool GWpy. If you would like to use this tool, please install anaconda if you have not already and create a virtual environment using the following command
conda create --name gravityspy-py38 -c conda-forge python=3.8 gwpy pandas psycopg2 sqlalchemy
After downloading one of the CSV files for a specific era and interferometer, please run the following Python script if you would like to download the data associated with the metadata in the CSV file. We recommend not trying to download too many images at one time. For example, the script below will read data on Hanford glitches from O2 that were classified by Gravity Spy and filter for only glitches that were labelled as Blips with 90% confidence or higher, and then download the first 4 rows of the filtered table.
from gwpy.table import GravitySpyTable
H1_O2 = GravitySpyTable.read('H1_O2.csv')
H1_O2[(H1_O2["ml_label"] == "Blip") & (H1_O2["ml_confidence"] > 0.9)]
H1_O2[0:4].download(nproc=1)
Each of the columns in the CSV files are taken from various different inputs:
[‘event_time’, ‘ifo’, ‘peak_time’, ‘peak_time_ns’, ‘start_time’, ‘start_time_ns’, ‘duration’, ‘peak_frequency’, ‘central_freq’, ‘bandwidth’, ‘channel’, ‘amplitude’, ‘snr’, ‘q_value’] contain metadata about the signal from the Omicron pipeline.
[‘gravityspy_id’] is the unique identifier for each glitch in the dataset.
[‘1400Ripples’, ‘1080Lines’, ‘Air_Compressor’, ‘Blip’, ‘Chirp’, ‘Extremely_Loud’, ‘Helix’, ‘Koi_Fish’, ‘Light_Modulation’, ‘Low_Frequency_Burst’, ‘Low_Frequency_Lines’, ‘No_Glitch’, ‘None_of_the_Above’, ‘Paired_Doves’, ‘Power_Line’, ‘Repeating_Blips’, ‘Scattered_Light’, ‘Scratchy’, ‘Tomte’, ‘Violin_Mode’, ‘Wandering_Line’, ‘Whistle’] contain the machine learning confidence for a glitch being in a particular Gravity Spy class (the confidence in all these columns should sum to unity). These use the original 22 classes in all cases.
[‘ml_label’, ‘ml_confidence’] provide the machine-learning predicted label for each glitch, and the machine learning confidence in its classification.
[‘url1’, ‘url2’, ‘url3’, ‘url4’] are the links to the publicly-available Omega scans for each glitch. ‘url1’ shows the glitch for a duration of 0.5 seconds, ‘url2’ for 1 seconds, ‘url3’ for 2 seconds, and ‘url4’ for 4 seconds.
For the most recently uploaded training set used in Gravity Spy machine learning algorithms, please see Gravity Spy Training Set on Zenodo.
For detailed information on the training set used for the original Gravity Spy machine learning paper, please see Machine learning for Gravity Spy: Glitch classification and dataset on Zenodo.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine. Since that time, it has been widely used by students, educators, and researchers all over the world as a primary source of machine learning data sets. As an indication of the impact of the archive, it has been cited over 1000 times, making it one of the top 100 most cited "papers" in all of computer science. The current version of the web site was designed in 2007 by Arthur Asuncion and David Newman, and this project is in collaboration with Rexa.info at the University of Massachusetts Amherst. Funding support from the National Science Foundation is gratefully acknowledged. Many people deserve thanks for making the repository a success. Foremost among them are the d
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The work involved in developing the dataset and benchmarking its use of machine learning is set out in the article ‘IoMT-TrafficData: Dataset and Tools for Benchmarking Intrusion Detection in Internet of Medical Things’. DOI: 10.1109/ACCESS.2024.3437214.
Please do cite the aforementioned article when using this dataset.
The increasing importance of securing the Internet of Medical Things (IoMT) due to its vulnerabilities to cyber-attacks highlights the need for an effective intrusion detection system (IDS). In this study, our main objective was to develop a Machine Learning Model for the IoMT to enhance the security of medical devices and protect patients’ private data. To address this issue, we built a scenario that utilised the Internet of Things (IoT) and IoMT devices to simulate real-world attacks. We collected and cleaned data, pre-processed it, and provided it into our machine-learning model to detect intrusions in the network. Our results revealed significant improvements in all performance metrics, indicating robustness and reproducibility in real-world scenarios. This research has implications in the context of IoMT and cybersecurity, as it helps mitigate vulnerabilities and lowers the number of breaches occurring with the rapid growth of IoMT devices. The use of machine learning algorithms for intrusion detection systems is essential, and our study provides valuable insights and a road map for future research and the deployment of such systems in live environments. By implementing our findings, we can contribute to a safer and more secure IoMT ecosystem, safeguarding patient privacy and ensuring the integrity of medical data.
The ZIP folder comprises two main components: Captures and Datasets. Within the captures folder, we have included all the captures used in this project. These captures are organized into separate folders corresponding to the type of network analysis: BLE or IP-Based. Similarly, the datasets folder follows a similar organizational approach. It contains datasets categorized by type: BLE, IP-Based Packet, and IP-Based Flows.
To cater to diverse analytical needs, the datasets are provided in two formats: CSV (Comma-Separated Values) and pickle. The CSV format facilitates seamless integration with various data analysis tools, while the pickle format preserves the intricate structures and relationships within the dataset.
This organization enables researchers to easily locate and utilize the specific captures and datasets they require, based on their preferred network analysis type or dataset type. The availability of different formats further enhances the flexibility and usability of the provided data.
Within this dataset, three sub-datasets are available, namely BLE, IP-Based Packet, and IP-Based Flows. Below is a table of the features selected for each dataset and consequently used in the evaluation model within the provided work.
Identified Key Features Within Bluetooth Dataset
Feature | Meaning |
btle.advertising_header | BLE Advertising Packet Header |
btle.advertising_header.ch_sel | BLE Advertising Channel Selection Algorithm |
btle.advertising_header.length | BLE Advertising Length |
btle.advertising_header.pdu_type | BLE Advertising PDU Type |
btle.advertising_header.randomized_rx | BLE Advertising Rx Address |
btle.advertising_header.randomized_tx | BLE Advertising Tx Address |
btle.advertising_header.rfu.1 | Reserved For Future 1 |
btle.advertising_header.rfu.2 | Reserved For Future 2 |
btle.advertising_header.rfu.3 | Reserved For Future 3 |
btle.advertising_header.rfu.4 | Reserved For Future 4 |
btle.control.instant | Instant Value Within a BLE Control Packet |
btle.crc.incorrect | Incorrect CRC |
btle.extended_advertising | Advertiser Data Information |
btle.extended_advertising.did | Advertiser Data Identifier |
btle.extended_advertising.sid | Advertiser Set Identifier |
btle.length | BLE Length |
frame.cap_len | Frame Length Stored Into the Capture File |
frame.interface_id | Interface ID |
frame.len | Frame Length Wire |
nordic_ble.board_id | Board ID |
nordic_ble.channel | Channel Index |
nordic_ble.crcok | Indicates if CRC is Correct |
nordic_ble.flags | Flags |
nordic_ble.packet_counter | Packet Counter |
nordic_ble.packet_time | Packet time (start to end) |
nordic_ble.phy | PHY |
nordic_ble.protover | Protocol Version |
Identified Key Features Within IP-Based Packets Dataset
Feature | Meaning |
http.content_length | Length of content in an HTTP response |
http.request | HTTP request being made |
http.response.code | Sequential number of an HTTP response |
http.response_number | Sequential number of an HTTP response |
http.time | Time taken for an HTTP transaction |
tcp.analysis.initial_rtt | Initial round-trip time for TCP connection |
tcp.connection.fin | TCP connection termination with a FIN flag |
tcp.connection.syn | TCP connection initiation with SYN flag |
tcp.connection.synack | TCP connection establishment with SYN-ACK flags |
tcp.flags.cwr | Congestion Window Reduced flag in TCP |
tcp.flags.ecn | Explicit Congestion Notification flag in TCP |
tcp.flags.fin | FIN flag in TCP |
tcp.flags.ns | Nonce Sum flag in TCP |
tcp.flags.res | Reserved flags in TCP |
tcp.flags.syn | SYN flag in TCP |
tcp.flags.urg | Urgent flag in TCP |
tcp.urgent_pointer | Pointer to urgent data in TCP |
ip.frag_offset | Fragment offset in IP packets |
eth.dst.ig | Ethernet destination is in the internal network group |
eth.src.ig | Ethernet source is in the internal network group |
eth.src.lg | Ethernet source is in the local network group |
eth.src_not_group | Ethernet source is not in any network group |
arp.isannouncement | Indicates if an ARP message is an announcement |
Identified Key Features Within IP-Based Flows Dataset
Feature | Meaning |
proto | Transport layer protocol of the connection |
service | Identification of an application protocol |
orig_bytes | Originator payload bytes |
resp_bytes | Responder payload bytes |
history | Connection state history |
orig_pkts | Originator sent packets |
resp_pkts | Responder sent packets |
flow_duration | Length of the flow in seconds |
fwd_pkts_tot | Forward packets total |
bwd_pkts_tot | Backward packets total |
fwd_data_pkts_tot | Forward data packets total |
bwd_data_pkts_tot | Backward data packets total |
fwd_pkts_per_sec | Forward packets per second |
bwd_pkts_per_sec | Backward packets per second |
flow_pkts_per_sec | Flow packets per second |
fwd_header_size | Forward header bytes |
bwd_header_size | Backward header bytes |
fwd_pkts_payload | Forward payload bytes |
bwd_pkts_payload | Backward payload bytes |
flow_pkts_payload | Flow payload bytes |
fwd_iat | Forward inter-arrival time |
bwd_iat | Backward inter-arrival time |
flow_iat | Flow inter-arrival time |
active | Flow active duration |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.
With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.
We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.
Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.
Usage
You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.
Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.
Data Extraction: In your terminal, you can call either
make
(recommended), or
julia --project="." --eval "using Pkg; Pkg.instantiate()"
julia --project="." extract-oq.jl
Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.
Further Reading
Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
or k-means.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please note that the file msl-labeled-data-set-v2.1.zip below contains the latest images and labels associated with this data set.
Data Set Description
The data set consists of 6,820 images that were collected by the Mars Science Laboratory (MSL) Curiosity Rover by three instruments: (1) the Mast Camera (Mastcam) Left Eye; (2) the Mast Camera Right Eye; (3) the Mars Hand Lens Imager (MAHLI). With the help from Dr. Raymond Francis, a member of the MSL operations team, we identified 19 classes with science and engineering interests (see the "Classes" section for more information), and each image is assigned with 1 class label. We split the data set into training, validation, and test sets in order to train and evaluate machine learning algorithms. The training set contains 5,920 images (including augmented images; see the "Image Augmentation" section for more information); the validation set contains 300 images; the test set contains 600 images. The training set images were randomly sampled from sol (Martian day) range 1 - 948; validation set images were randomly sampled from sol range 949 - 1920; test set images were randomly sampled from sol range 1921 - 2224. All images are resized to 227 x 227 pixels without preserving the original height/width aspect ratio.
Directory Contents
The label files are formatted as below:
"Image-file-name class_in_integer_representation"
Labeling Process
Each image was labeled with help from three different volunteers (see Contributor list). The final labels are determined using the following processes:
Classes
There are 19 classes identified in this data set. In order to simplify our training and evaluation algorithms, we mapped the class names from string to integer representations. The names of classes, string-integer mappings, distributions are shown below:
Class name, counts (training set), counts (validation set), counts (test set), integer representation
Arm cover, 10, 1, 4, 0
Other rover part, 190, 11, 10, 1
Artifact, 680, 62, 132, 2
Nearby surface, 1554, 74, 187, 3
Close-up rock, 1422, 50, 84, 4
DRT, 8, 4, 6, 5
DRT spot, 214, 1, 7, 6
Distant landscape, 342, 14, 34, 7
Drill hole, 252, 5, 12, 8
Night sky, 40, 3, 4, 9
Float, 190, 5, 1, 10
Layers, 182, 21, 17, 11
Light-toned veins, 42, 4, 27, 12
Mastcam cal target, 122, 12, 29, 13
Sand, 228, 19, 16, 14
Sun, 182, 5, 19, 15
Wheel, 212, 5, 5, 16
Wheel joint, 62, 1, 5, 17
Wheel tracks, 26, 3, 1, 18
Image Augmentation
Only the training set contains augmented images. 3,920 of the 5,920 images in the training set are augmented versions of the remaining 2000 original training images. Images taken by different instruments were augmented differently. As shown below, we employed 5 different methods to augment images. Images taken by the Mastcam left and right eye cameras were augmented using a horizontal flipping method, and images taken by the MAHLI camera were augmented using all 5 methods. Note that one can filter based on the file names listed in the train-set.txt file to obtain a set of non-augmented images.
Acknowledgment
The authors would like to thank the volunteers (as in the Contributor list) who provided annotations for this data set. We would also like to thank the PDS Imaging Note for the continuous support of this work.
To develop a simulation that collects both visual information, as well as grasp information about different objects using a multi-fingered hand. These sources of data can be used in the future to learn integrated object-action grasp representations.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The BoolQ dataset is a valuable resource crafted for question answering tasks. It is organised into two main splits: a validation split and a training split. The primary aim of this dataset is to facilitate research in natural language processing (NLP) and machine learning (ML), particularly in tasks involving the answering of questions based on provided text. It offers a rich collection of user-posed questions, their corresponding answers, and the passages from which these answers are derived. This enables researchers to develop and evaluate models for real-world scenarios where information needs to be retrieved or understood from textual sources.
true
appearing 5,874 times (62%) and false
appearing 3,553 times (38%).The BoolQ dataset consists of two main parts: a validation split and a training split. Both splits feature consistent data fields: question
, answer
, and passage
. The train.csv
file, for example, is part of the training data. While specific row or record counts are not detailed for the entire dataset, the 'answer' column uniquely features 9,427 boolean values.
This dataset is ideally suited for: * Question Answering Systems: Training models to identify correct answers from multiple choices, given a question and a passage. * Machine Reading Comprehension: Developing models that can understand and interpret written text effectively. * Information Retrieval: Enabling models to retrieve relevant passages or documents that contain answers to a given query or question.
The sources do not specify the geographic, time range, or demographic scope of the data.
CC0
The BoolQ dataset is primarily intended for researchers and developers working in artificial intelligence fields such as Natural Language Processing (NLP) and Machine Learning (ML). It is particularly useful for those building or evaluating: * Question answering algorithms * Information retrieval systems * Machine reading comprehension models
Original Data Source: BoolQ - Question-Answer-Passage Consistency
This dataset contains a set of data files used as input for a World Bank research project (empirical comparative assessment of machine learning algorithms applied to poverty prediction). The objective of the project was to compare the performance of a series of classification algorithms. The dataset contains variables at the household, individual, and community levels. The variables selected to serve as potential predictors in the machine learning models are all qualitative variables (except for the household size). Information on household consumption is included, but in the form of dummy variables (indicating whether the household consumed or not each specific product or service listed in the survey questionnaire). The household-level data file contains the variables "Poor / Non poor" which served as the predicted variable ("label") in the models.
One of the data files included in the dataset contains data on household consumption (amounts) by main categories of products and services. This data file was not used in the prediction model. It is used only for the purpose of analyzing the models mis-classifications (in particular, to identify how far the mis-classified households are from the national poverty line).
These datasets are provided to allow interested users to replicate the analysis done for the project using Python 3 (a collection of Jupyter Notebooks containing the documented scripts is openly available on GitHub).
National
Sample survey data [ssd]
The IHS3 sampling frame is based on the listing information and cartography from the 2008 Malawi Population and Housing Census (PHC); includes the three major regions of Malawi, namely North, Center and South; and is stratified into rural and urban strata. The urban strata include the four major urban areas: Lilongwe City, Blantyre City, Mzuzu City, and the Municipality of Zomba. All other areas are considered as rural areas, and each of the 27 districts were considered as a separate sub-stratum as part of the main rural stratum. It was decided to exclude the island district of Likoma from the IHS3 sampling frame, since it only represents about 0.1% of the population of Malawi, and the corresponding cost of enumeration would be relatively high. The sampling frame further excludes the population living in institutions, such as hospitals, prisons and military barracks. Hence, the IHS3 strata are composed of 31 districts in Malawi.
A stratified two-stage sample design was used for the IHS3.
Face-to-face [f2f]
The survey was collectd using four questionnaires: 1) Household Questionnaire 2) Agriculture Questionnaire 3) Fishery Questionnaire 4) Community Questionnaire
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
scaled and modified to represent a number a training set dataset.It can be used to detect and identify object type based on material type in the image.In this process both training data set and test data set can be generated from these image files.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This data set is uploaded as supporting information for the publication entitled:Effect of data preprocessing and machine learning hyperparameters on mass spectrometry imaging modelsFiles are as follows:polymer_microarray_data.mat - MATLAB workspace file containing peak-picked ToF-SIMS data (hyperspectral array) for the polymer microarray sample.nylon_data.mat - MATLAB workspace file containing m/z binned ToF-SIMS data (hyperspectral array) for the semi-synthetic nylon data set, generated from 7 nylon samples.Additional details about the datasets can be found in the published article.If you use this data set in your work, please cite our work as follows:Cite as: Gardner et al.. J. Vac. Sci. Technol. A 41, 000000 (2023); doi: 10.1116/6.0002788