100+ datasets found

Code for Predicting MIEs from Gene Expression and Chemical Target Labels...
catalog.data.gov
datasets.ai
+1more
Updated Apr 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2022). Code for Predicting MIEs from Gene Expression and Chemical Target Labels with Machine Learning (MIEML) [Dataset]. https://catalog.data.gov/dataset/code-for-predicting-mies-from-gene-expression-and-chemical-target-labels-with-machine-lear
Explore at:
Dataset updated
Apr 21, 2022
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Modeling data and analysis scripts generated during the current study are available in the github repository: https://github.com/USEPA/CompTox-MIEML. RefChemDB is available for download as supplemental material from its original publication (PMID: 30570668). LINCS gene expression data are publicly available and accessible through the gene expression omnibus (GSE92742 and GSE70138) at https://www.ncbi.nlm.nih.gov/geo/ . This dataset is associated with the following publication: Bundy, J., R. Judson, A. Williams, C. Grulke, I. Shah, and L. Everett. Predicting Molecular Initiating Events Using Chemical Target Annotations and Gene Expression. BioData Mining. BioMed Central Ltd, London, UK, issue}: 7, (2022).
i
A collection of nine multi-label text classification datasets
ieee-dataport.org
Updated Nov 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yiming Wang (2024). A collection of nine multi-label text classification datasets [Dataset]. https://ieee-dataport.org/documents/collection-nine-multi-label-text-classification-datasets
Explore at:
Dataset updated
Nov 4, 2024
Authors
Yiming Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
RCV1
Machine Learning Basics for Beginners🤖🧠
kaggle.com
zip
Updated Jun 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bhanupratap Biswas (2023). Machine Learning Basics for Beginners🤖🧠 [Dataset]. https://www.kaggle.com/datasets/bhanupratapbiswas/machine-learning-basics-for-beginners
Explore at:
zip(492015 bytes)Available download formats
Dataset updated
Jun 22, 2023
Authors
Bhanupratap Biswas
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:

Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.

Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.

Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.

Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).

Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).

Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.

Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.

Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.

Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.

Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.

These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.
m
Data from: SANAD: Single-Label Arabic News Articles Dataset for Automatic...
data.mendeley.com
Updated Sep 2, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omar Einea (2019). SANAD: Single-Label Arabic News Articles Dataset for Automatic Text Categorization [Dataset]. http://doi.org/10.17632/57zpx667y9.2
Explore at:
Unique identifier
https://doi.org/10.17632/57zpx667y9.2
Dataset updated
Sep 2, 2019
Authors
Omar Einea
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SANAD Dataset is a large collection of Arabic news articles that can be used in different Arabic NLP tasks such as Text Classification and Word Embedding. The articles were collected using Python scripts written specifically for three popular news websites: AlKhaleej, AlArabiya and Akhbarona.

All datasets have seven categories [Culture, Finance, Medical, Politics, Religion, Sports and Tech], except AlArabiya which doesn’t have [Religion]. SANAD contains a total number of 190k+ articles.

How to use it:

Unzip compressed resources.

Each folder contains 6-7 sub-folders which are labeled by the category's name.

Each sub-folder contains a set of article files corresponding to its category.

SANAD_SUBSET is a balanced benchmark dataset (from SANAD) that is used in our research work. It contains the training (90%) and testing (10%) sets.

How to use it:

Unzip the compressed file.

There are 3 main folders containing the 3 datasets: Akhbarona, Khaleej, and Arabiya.

Each dataset-folder contains 2 sub-folders: training and testing.

The training and testing folders include the balanced categories sub-folders.
Multi-label code-smell dataset
figshare.com
txt
Updated Aug 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Binh Nguyen Thanh (2023). Multi-label code-smell dataset [Dataset]. http://doi.org/10.6084/m9.figshare.24024591.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24024591.v1
Dataset updated
Aug 24, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Binh Nguyen Thanh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The multi-label code-smell dataset for studies related to multi-label classification
Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human...
catalog.data.gov
s.cnmilf.com
+1more
Updated Sep 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2025). Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models [Dataset]. https://catalog.data.gov/dataset/dataset-an-open-combinatorial-diffraction-dataset-including-consensus-human-and-machine-le
Explore at:
Dataset updated
Sep 30, 2025
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
The open dataset, software, and other files accompanying the manuscript "An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models," submitted for publication to Integrated Materials and Manufacturing Innovations.Machine learning and autonomy are increasingly prevalent in materials science, but existing models are often trained or tuned using idealized data as absolute ground truths. In actual materials science, "ground truth" is often a matter of interpretation and is more readily determined by consensus. Here we present the data, software, and other files for a study using as-obtained diffraction data as a test case for evaluating the performance of machine learning models in the presence of differing expert opinions. We demonstrate that experts with similar backgrounds can disagree greatly even for something as intuitive as using diffraction to identify the start and end of a phase transformation. We then use a logarithmic likelihood method to evaluate the performance of machine learning models in relation to the consensus expert labels and their variance. We further illustrate this method's efficacy in ranking a number of state-of-the-art phase mapping algorithms. We propose a materials data challenge centered around the problem of evaluating models based on consensus with uncertainty. The data, labels, and code used in this study are all available online at data.gov, and the interested reader is encouraged to replicate and improve the existing models or to propose alternative methods for evaluating algorithmic performance.
R
Color Label Dataset
universe.roboflow.com
zip
Updated May 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
deep learning (2022). Color Label Dataset [Dataset]. https://universe.roboflow.com/deep-learning-dcksd/color-label
Explore at:
zipAvailable download formats
Dataset updated
May 22, 2022
Dataset authored and provided by
deep learning
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Color Bottle Bounding Boxes
Description
Color Label

## Overview Color Label is a dataset for object detection tasks - it contains Color Bottle annotations for 2,258 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
h
multi-label-web-categorization
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taimur, multi-label-web-categorization [Dataset]. https://huggingface.co/datasets/tshasan/multi-label-web-categorization
Explore at:
Authors
Taimur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Multi-Label Web Page Classification Dataset

Dataset Description

The Multi-Label Web Page Classification Dataset is a curated dataset containingweb page titles and snippets, extracted from the CC-Meta25-1M dataset. Each entry has been automatically categorized into multiple predefined categories using ChatGPT-4o-mini. This dataset is designed for multi-label text classification tasks, making it ideal for training and evaluating machine learning models in web content… See the full description on the dataset page: https://huggingface.co/datasets/tshasan/multi-label-web-categorization.
UCI and OpenML Data Sets for Ordinal Quantification
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Jul 25, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8177302
Dataset updated
Jul 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
u
3D Microvascular Image Data and Labels for Machine Learning
rdr.ucl.ac.uk
datasetcatalog.nlm.nih.gov
bin
Updated Apr 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natalie Holroyd; Claire Walsh; Emmeline Brown; Emma Brown; Yuxin Zhang; Carles Bosch Pinol; Simon Walker-Samuel (2024). 3D Microvascular Image Data and Labels for Machine Learning [Dataset]. http://doi.org/10.5522/04/25715604.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5522/04/25715604.v1
Dataset updated
Apr 30, 2024
Dataset provided by
University College London
Authors
Natalie Holroyd; Claire Walsh; Emmeline Brown; Emma Brown; Yuxin Zhang; Carles Bosch Pinol; Simon Walker-Samuel
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
These images and associated binary labels were collected from collaborators across multiple universities to serve as a diverse representation of biomedical images of vessel structures, for use in the training and validation of machine learning tools for vessel segmentation. The dataset contains images from a variety of imaging modalities, at different resolutions, using difference sources of contrast and featuring different organs/ pathologies. This data was use to train, test and validated a foundational model for 3D vessel segmentation, tUbeNet, which can be found on github. The paper descripting the training and validation of the model can be found here. Filenames are structured as follows: Data - [Modality]_[species Organ]_[resolution].tif Labels - [Modality]_[species Organ]_[resolution]_labels.tif Sub-volumes of larger dataset - [Modality]_[species Organ]_subvolume[dimensions in pixels].tif Manual labelling of blood vessels was carried out using Amira (2020.2, Thermo-Fisher, UK). Training data: opticalHREM_murineLiver_2.26x2.26x1.75um.tif: A high resolution episcopic microscopy (HREM) dataset, acquired in house by staining a healthy mouse liver with Eosin B and imaged using a standard HREM protocol. NB: 25% of this image volume was withheld from training, for use as test data. CT_murineTumour_20x20x20um.tif: X-ray microCT images of a microvascular cast, taken from a subcutaneous mouse model of colorectal cancer (acquired in house). NB: 25% of this image volume was withheld from training, for use as test data. RSOM_murineTumour_20x20um.tif: Raster-Scanning Optoacoustic Mesoscopy (RSOM) data from a subcutaneous tumour model (provided by Emma Brown, Bohndiek Group, University of Cambridge). The image data has undergone filtering to reduce the background (Brown et al., 2019). OCTA_humanRetina_24x24um.tif: retinal angiography data obtained using Optical Coherence Tomography Angiography (OCT-A) (provided by Dr Ranjan Rajendram, Moorfields Eye Hospital). Test data: MRI_porcineLiver_0.9x0.9x5mm.tif: T1-weighted Balanced Turbo Field Echo Magnetic Resonance Imaging (MRI) data from a machine-perfused porcine liver, acquired in-house. Test Data MFHREM_murineTumourLectin_2.76x2.76x2.61um.tif: a subcutaneous colorectal tumour mouse model was imaged in house using Multi-fluorescence HREM in house, with Dylight 647 conjugated lectin staining the vasculature (Walsh et al., 2021). The image data has been processed using an asymmetric deconvolution algorithm described by Walsh et al., 2020. NB: A sub-volume of 480x480x640 voxels was manually labelled (MFHREM_murineTumourLectin_subvolume480x480x640.tif). MFHREM_murineBrainLectin_0.85x0.85x0.86um.tif: an MF-HREM image of the cortex of a mouse brain, stained with Dylight-647 conjugated lectin, was acquired in house (Walsh et al., 2021). The image data has been downsampled and processed using an asymmetric deconvolution algorithm described by Walsh et al., 2020. NB: A sub-volume of 1000x1000x99 voxels was manually labelled. This sub-volume is provided at full resolution and without preprocessing (MFHREM_murineBrainLectin_subvol_0.57x0.57x0.86um.tif). 2Photon_murineOlfactoryBulbLectin_0.2x0.46x5.2um.tif: two-photon data of mouse olfactory bulb blood vessels, labelled with sulforhodamine 101, was kindly provided by Yuxin Zhang at the Sensory Circuits and Neurotechnology Lab, the Francis Crick Institute (Bosch et al., 2022). NB: A sub-volume of 500x500x79 voxel was manually labelled (2Photon_murineOlfactoryBulbLectin_subvolume500x500x79.tif). References: Bosch, C., Ackels, T., Pacureanu, A., Zhang, Y., Peddie, C. J., Berning, M., Rzepka, N., Zdora, M. C., Whiteley, I., Storm, M., Bonnin, A., Rau, C., Margrie, T., Collinson, L., & Schaefer, A. T. (2022). Functional and multiscale 3D structural investigation of brain tissue through correlative in vivo physiology, synchrotron microtomography and volume electron microscopy. Nature Communications 2022 13:1, 13(1), 1–16. https://doi.org/10.1038/s41467-022-30199-6 Brown, E., Brunker, J., & Bohndiek, S. E. (2019). Photoacoustic imaging as a tool to probe the tumour microenvironment. DMM Disease Models and Mechanisms, 12(7). https://doi.org/10.1242/DMM.039636 Walsh, C., Holroyd, N. A., Finnerty, E., Ryan, S. G., Sweeney, P. W., Shipley, R. J., & Walker-Samuel, S. (2021). Multifluorescence High-Resolution Episcopic Microscopy for 3D Imaging of Adult Murine Organs. Advanced Photonics Research, 2(10), 2100110. https://doi.org/10.1002/ADPR.202100110 Walsh, C., Holroyd, N., Shipley, R., & Walker-Samuel, S. (2020). Asymmetric Point Spread Function Estimation and Deconvolution for Serial-Sectioning Block-Face Imaging. Communications in Computer and Information Science, 1248 CCIS, 235–249. https://doi.org/10.1007/978-3-030-52791-4_19  
m
MAAD : Multi-Label Arabic Articles Dataset
data.mendeley.com
Updated Oct 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marwah Yahya Al-Nahari (2025). MAAD : Multi-Label Arabic Articles Dataset [Dataset]. http://doi.org/10.17632/hbfc9j8hj8.2
Explore at:
Unique identifier
https://doi.org/10.17632/hbfc9j8hj8.2
Dataset updated
Oct 27, 2025
Authors
Marwah Yahya Al-Nahari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The MAAD dataset represents a comprehensive collection of Arabic news articles that may be employed across a diverse array of Arabic Natural Language Processing (NLP) applications, including but not limited to classification, text generation, summarization, and various other tasks. The dataset was diligently assembled through the application of specifically designed Python scripts that targeted six prominent news platforms: Al Jazeera, BBC Arabic, Youm7, Russia Today, and Al Ummah, in conjunction with regional and local media outlets, ultimately resulting in a total of 602,792 articles. This dataset exhibits a total word count of 29,371,439, with the number of unique words totaling 296,518; the average word length has been determined to be 6.36 characters, while the mean article length is calculated at 736.09 characters. This extensive dataset is categorized into ten distinct classifications: Political, Economic, Cultural, Arts, Sports, Health, Technology, Community, Incidents, and Local. The data fields are categorized into five distinct types: Title, Article, Summary, Category, and Published_ Date. The MAAD dataset is structured into six files, each named after the corresponding news outlets from which the data was sourced; within each directory, text files are provided, containing the number of categories represented in a single file, formatted in txt to accommodate all news articles. This dataset serves as an expansive standard resource designed for utilization within the context of our research endeavors.
Data from: Machine Learning and Deep Learning Techniques for Colocated MIMO...
figshare.com
bin
Updated Mar 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alessandro Davoli; Giorgio Guerzoni; Giorgio Matteo Vitetta (2025). Machine Learning and Deep Learning Techniques for Colocated MIMO Radars: A Tutorial Overview [Dataset]. http://doi.org/10.6084/m9.figshare.28574234.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28574234.v1
Dataset updated
Mar 11, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Alessandro Davoli; Giorgio Guerzoni; Giorgio Matteo Vitetta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Last update: February 2021.The dataset folder includes both raw and post-processed radar data used for training and testing the networks proposed in Sect. VIII of the article “Machine Learning and Deep Learning Techniques for Colocated MIMO Radars: A Tutorial Overview”.The folder Human Activity Classification contains“Raw” folder where 150 files acquired with our FMCW radar sensor are given inside the “doppler_dataset” zip folder; they are divided in 50 for walking, 50 for jumping and 50 for running;“Post_process” divided in- “Machine Learning” folder including “dataset_ML_doppler_real_activities.mat”; this dataset has been used for training and testing the SVM, K-NN and Adaboost described in Sect. VIII-A).- The 150x4 matrix “X_meas” including the features described by eqs. (227)-(234) and the 150x1 vector of char “labels_py” containing the associated labels.- “Deep Learning” folder containing “dataset_DL_doppler_real_activities.mat”; this dataset is composed by 150 structs of data, where each of them, associated to a specific activity, includes:- The label associated to the considered activity,- The overall range variations from the beginning to the end of the motion “delta_R”;- The Range-Doppler map “RD_map”;- The normalized spectrogram “SP_Norm”;- The Cadence Velocity Diagram “CVD”;- The period of the spectrogram “per”;- The peaks associated to the greatest three cadence frequencies in “peaks_cad”;- The three strongest cadence frequencies and their normalized version in “cad_freqs” and “cad_freqs_norm”;- The strongest cadence frequency “c1”;- The three velocity profiles associated to the three strongest cadence frequencies “matr_vex”.The spectrogram images (SP_Norm) contained in this dataset were used for training and testing the CNN in Sect. VIII-A).The folder Obstacle Detection contains“Raw” folder where raw data acquired with our radar system and TOF camera in the presence of a multi target or single target scenario are given inside the “obst_detect_Raw_mat” zip folder. It’s important to note that each radar frame and each TOF camera image have their own time stamp, but since they come from different sensor have to be synchronized.“Post_process” divided in- “Neural Net” folder containing “inputs_bis_1.mat” and “t_d_1.mat”, where- “inputs_bis_1.mat” contains the vectors of features of size 32x1, used for training and testing the feed-forward neural network described in Sect. VIII-B) (see eqs. (243)-(251)),- “t_d_1.mat” contains the associated 2x1 vectors of labels (see eq. (235)).- “Yolo v2” folder containing the folder “Dataset_YOLO” and the table “obj_dataset_tab”, where- “Dataset_YOLO_v2” contains (inside the sub-folder “obj_dataset”) the Range-Azimuth maps used for training the YOLO v2 network (see eqs. (257)-(258) and Fig. 30);- “obj_dataset_tab” contains the path, the bounding box and the label associated to the Range-Azimuth maps (see eq. (256)-(266)).Cite as: A. Davoli, G. Guerzoni and G. M. Vitetta, "Machine Learning and Deep Learning Techniques for Colocated MIMO Radars: A Tutorial Overview," in IEEE Access, vol. 9, pp. 33704-33755, 2021, doi: 10.1109/ACCESS.2021.3061424.
d
Data from: Training dataset for NABat Machine Learning V1.0
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
Explore at:
Dataset updated
Nov 26, 2025
Dataset provided by
U.S. Geological Survey
Description
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
Z
Multi-Label Datasets with Missing Values
data.niaid.nih.gov
Updated Mar 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonio F. L. Jacob Jr.; Fabrício A. do Carmo; Ádamo L. de Santana; Ewaldo Santana; Fábio M. F. Lobato (2023). Multi-Label Datasets with Missing Values [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7748932
Explore at:
Dataset updated
Mar 19, 2023
Dataset provided by
UEMA
UFOPA
Fuji Electric Co. Ltd.
Authors
Antonio F. L. Jacob Jr.; Fabrício A. do Carmo; Ádamo L. de Santana; Ewaldo Santana; Fábio M. F. Lobato
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Consisting of six multi-label datasets from the UCI Machine Learning repository.

Each dataset contains missing values which have been artificially added at the following rates: 5, 10, 15, 20, 25, and 30%. The “amputation” was performed using the “Missing Completely at Random” mechanism.

File names are represented as follows:

amp_DB_MR.arff

where:

DB = original dataset; MR = missing rate.

For more details, please read:

IEEE Access article (in review process)
m
Data from: MLRSNet: A Multi-label High Spatial Resolution Remote Sensing...
data.mendeley.com
Updated Sep 18, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiaoman Qi (2023). MLRSNet: A Multi-label High Spatial Resolution Remote Sensing Dataset for Semantic Scene Understanding [Dataset]. http://doi.org/10.17632/7j9bv9vwsx.4
Explore at:
Unique identifier
https://doi.org/10.17632/7j9bv9vwsx.4
Dataset updated
Sep 18, 2023
Authors
Xiaoman Qi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MLRSNet provides different perspectives of the world captured from satellites. That is, it is composed of high spatial resolution optical satellite images. MLRSNet contains 109,161 remote sensing images that are annotated into 46 categories, and the number of sample images in a category varies from 1,500 to 3,000. The images have a fixed size of 256×256 pixels with various pixel resolutions (~10m to 0.1m). Moreover, each image in the dataset is tagged with several of 60 predefined class labels, and the number of labels associated with each image varies from 1 to 13. The dataset can be used for multi-label based image classification, multi-label based image retrieval, and image segmentation.

The Dataset includes: 1. Images folder: 46 categories, 109,161 high-spatial resolution remote sensing images. 2. Labels folders: each category has a .csv file. 3. Categories_names. xlsx: Sheet1 lists the names of 46 categories, and the Sheet2 shows the associated multi-label to each category.
m
RTAnews: A Benchmark for Multi-label Arabic Text Categorization
data.mendeley.com
semantichub.ijs.si
Updated Aug 18, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bassam Al-Salemi (2018). RTAnews: A Benchmark for Multi-label Arabic Text Categorization [Dataset]. http://doi.org/10.17632/322pzsdxwy.1
Explore at:
Unique identifier
https://doi.org/10.17632/322pzsdxwy.1
Dataset updated
Aug 18, 2018
Authors
Bassam Al-Salemi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
RTAnews dataset is a collections of multi-label Arabic texts, collected form Russia Today in Arabic news portal. It consists of 23,837 texts (news articles) distributed over 40 categories, and divided into 15,001 texts for the training and 8,836 texts for the test.

The original dataset (without preprocessing), a preprocessed version of the dataset, versions of the dataset in MEKA and Mulan formats, single-label version, and WEAK version all are available.

For any enquiry or support regarding the dataset, please feel free to contact us via bassalemi at gmail dot com
ArXiv CS Papers Multi-Label Classification (200K)
kaggle.com
zip
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sharukh Rahman (2023). ArXiv CS Papers Multi-Label Classification (200K) [Dataset]. https://www.kaggle.com/datasets/devintheai/arxiv-cs-papers-multi-label-classification-200k-v1
Explore at:
zip(83841332 bytes)Available download formats
Dataset updated
Jun 7, 2023
Authors
Sharukh Rahman
Description
The ArXiv CS Papers Multi-Label Classification dataset is a comprehensive collection of research papers from the computer science domain. This dataset is intended for multi-label classification tasks and contains a diverse range of research papers spanning various topics within computer science.

The dataset consists of approximately 200,000+ research papers and includes the following columns:

Paper ID: A unique identifier for each research paper in the dataset.

Title: The title of the research paper.

Abstract: A brief summary or abstract of the research paper.

Year: The publication year of the research paper.

Primary Category: The primary category of the research paper, representing the main topic or area of focus.

Categories: Additional categories or subtopics associated with the research paper.

This dataset is well-suited for tasks related to text classification, topic modeling, information retrieval, and other natural language processing (NLP) tasks. Researchers and practitioners can leverage this dataset to develop and evaluate machine learning models for multi-label classification on a wide range of computer science topics.

Note: Please refer to the original ArXiv repository for access to the full-text content of the papers and proper citation guidelines. This dataset contains metadata and should be used for research and educational purposes only.

We hope that the ArXiv CS Papers Multi-Label Classification dataset serves as a valuable resource for researchers, data scientists, and machine learning enthusiasts in their quest to advance knowledge and understanding in the field of computer science.
O
BUTTER - Empirical Deep Learning Dataset
data.openei.org
datasets.ai
+2more
code, data, website
Updated May 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek; Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek (2022). BUTTER - Empirical Deep Learning Dataset [Dataset]. http://doi.org/10.25984/1872441
Explore at:
code, website, dataAvailable download formats
Unique identifier
https://doi.org/10.25984/1872441
Dataset updated
May 20, 2022
Dataset provided by
National Renewable Energy Laboratory
Open Energy Data Initiative (OEDI)
USDOE Office of Energy Efficiency and Renewable Energy (EERE), Multiple Programs (EE)
Authors
Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek; Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The BUTTER Empirical Deep Learning Dataset represents an empirical study of the deep learning phenomena on dense fully connected networks, scanning across thirteen datasets, eight network shapes, fourteen depths, twenty-three network sizes (number of trainable parameters), four learning rates, six minibatch sizes, four levels of label noise, and fourteen levels of L1 and L2 regularization each. Multiple repetitions (typically 30, sometimes 10) of each combination of hyperparameters were preformed, and statistics including training and test loss (using a 80% / 20% shuffled train-test split) are recorded at the end of each training epoch. In total, this dataset covers 178 thousand distinct hyperparameter settings ("experiments"), 3.55 million individual training runs (an average of 20 repetitions of each experiments), and a total of 13.3 billion training epochs (three thousand epochs were covered by most runs). Accumulating this dataset consumed 5,448.4 CPU core-years, 17.8 GPU-years, and 111.2 node-years.

Drug Labels & Side Effects Dataset | 1400+ Records

kaggle.com

zip

Updated Aug 2, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Pratyush Puri (2025). Drug Labels & Side Effects Dataset | 1400+ Records [Dataset]. https://www.kaggle.com/datasets/pratyushpuri/drug-labels-and-side-effects-dataset-1400-records

Explore at:

zip(51886 bytes)Available download formats

Dataset updated

Aug 2, 2025

Authors

Pratyush Puri

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Drug Labels and Side Effects Dataset

Dataset Overview

This comprehensive pharmaceutical synthetic dataset contains 1,393 records of synthetic drug information with 15 columns, designed for data science projects focusing on healthcare analytics, drug safety analysis, and pharmaceutical research. The dataset simulates real-world pharmaceutical data with appropriate variety and realistic constraints for machine learning applications.

Dataset Specifications

Attribute	Value
Total Records	1,393
Total Columns	15
File Format	CSV
Data Types	Mixed (intentional for data cleaning practice)
Domain	Pharmaceutical/Healthcare
Use Case	ML Training, Data Analysis, Healthcare Research

Column Specifications

Categorical Features

Column Name	Data Type	Unique Values	Description	Example Values
`drug_name`	Object	1,283 unique	Pharmaceutical drug names with realistic naming patterns	"Loxozepam32", "Amoxparin43", "Virazepam10"
`manufacturer`	Object	10 unique	Major pharmaceutical companies	Pfizer Inc., AstraZeneca, Johnson & Johnson
`drug_class`	Object	10 unique	Therapeutic drug classifications	Antibiotic, Analgesic, Antidepressant, Vaccine
`indications`	Object	10 unique	Medical conditions the drug treats	"Pain relief", "Bacterial infections", "Depression treatment"
`side_effects`	Object	434 unique	Combination of side effects (1-3 per drug)	"Nausea, Dizziness", "Headache, Fatigue, Rash"
`administration_route`	Object	7 unique	Method of drug delivery	Oral, Intravenous, Topical, Inhalation, Sublingual
`contraindications`	Object	10 unique	Medical warnings for drug usage	"Pregnancy", "Heart disease", "Liver disease"
`warnings`	Object	10 unique	Safety instructions and precautions	"Take with food", "Avoid alcohol", "Monitor blood pressure"
`batch_number`	Object	1,393 unique	Manufacturing batch identifiers	"xr691zv", "Ye266vU", "Rm082yX"
`expiry_date`	Object	782 unique	Drug expiration dates (YYYY-MM-DD)	"2025-12-13", "2027-03-09", "2026-10-06"
`side_effect_severity`	Object	3 unique	Severity classification	Mild, Moderate, Severe
`approval_status`	Object	3 unique	Regulatory approval status	Approved, Pending, Rejected

Numerical Features

Column Name	Data Type	Range	Mean	Std Dev	Description
`approval_year`	Float/String*	1990-2024	2006.7	10.0	FDA/regulatory approval year
`dosage_mg`	Float/String*	10-990 mg	499.7	290.0	Medication strength in milligrams
`price_usd`	Float/String*	$2.32-$499.24	$251.12	$144.81	Drug price in US dollars

*Intentionally stored as mixed types for data cleaning practice

Key Statistics

Manufacturer Distribution

Manufacturer	Count	Percentage
Pfizer Inc.	170	12.2%
AstraZeneca	~140	~10.0%
Merck & Co.	~140	~10.0%
Johnson & Johnson	~140	~10.0%
GlaxoSmithKline	~140	~10.0%
Others	~623	~44.8%

Drug Class Distribution

Drug Class	Count	Most Common
Anti-inflammatory	154	✓
Antibiotic	~140
Antidepressant	~140
Antiviral	~140
Vaccine	~140
Others	~679

Side Effect Severity

Severity	Count	Percentage
Severe	488	35.0%
Moderate	~453	~32.5%
Mild	~452	~32.5%

Potential Use Cases

1. Machine Learning Applications

Drug Approval Prediction: Predict approval likelihood based on drug characteristics
Price Prediction: Estimate drug pricing using features like class, manufacturer, dosage
Side Effect Classification: Classify severity based on drug properties
Market Success Analysis: Analyze factors contributing to drug market performance

2. Data Engineering Projects

ETL Pipeline Development: Practice data cleaning and transformation
Data Quality Assessment: Implement data validation and quality checks
Database Design: Create normalized pharmaceutical database schema
Real-time Processing: Stream processing for drug monitoring systems

3. Business Intelligence

Pharmaceutical Market Analysis: Manufacturer market share and competitive analysis
Drug Safety Analytics: Side effect patterns and safety profile analysis
Regulatory Compliance: Approval trends and regulatory timeline analysis
Pricing Strategy: Competitive pricing analysis across drug classes

Recommended Next Steps

Data Cleaning Pipeline: Implement comprehe...

Physiological signals during activities for daily life: Dataset
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Mar 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eduardo Gutierrez Maestro; Eduardo Gutierrez Maestro (2022). Physiological signals during activities for daily life: Dataset [Dataset]. http://doi.org/10.5281/zenodo.6391454
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6391454
Dataset updated
Mar 29, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Eduardo Gutierrez Maestro; Eduardo Gutierrez Maestro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset used in this work is composed by four participants, two men and two women. Each of them carried the wearable device Empatica E4 for a total number of 15 days. They carried the wearable during the day, and during the nights we asked participants to charge and load the data into an external memory unit. During these days, participants were asked to answer EMA questionnaires which are used to label our data. However, some participants could not complete the full experiment or some days were discarded due to data corruption. Specific demographic information, total sampling days and total number of EMA answers can be found in table I.

Participant 1 Participant 2 Participant 3 Participant 4
Age 67 55 60 63
Gender Male Female Male Female

Final Valid Days
9 15 12 13
Total EMAs 42 57 64 46

Table I. Summary of participants' collected data.

This dataset provides three different type of labels. Activeness and happiness are two of these labels. These are the answers to EMA questionnaires that participants reported during their daily activities. These labels are numbers between 0 and 4.
These labels are used to interpolate the mental well-being state according to [1] We report in our dataset a total number of eight emotional states: (1) pleasure, (2) excitement, (3) arousal, (4) distress, (5) misery, (6) depression, (7) sleepiness, and (8) contentment.

The data we provide in this repository consist of two type of files:

CSV files: These files contain physiological signals recorded during the data collection process. The first line of each CSV file defines the timestamp by which data started being sampled. The second line defines the sampling frequency used for gathering the signal. From the third line until the end of the file, one can find sampled datapoints.

Excel files: These files contain the labels obtained from EMA answers. It is indicated the timestamp at which the answer was registered. Labels for pleasure, activeness and mood can be found in this file.

NOTE: Files are numbered according to each specific sampling day. For example, ACC1.csv corresponds to the signal ACC for sampling day 1. The same applied to excel files.

Code and a tutorial of how to labelled and extract features can be found in this repository: https://github.com/edugm94/temporal-feat-emotion-prediction

References:

[1] . A. Russell, “A circumplex model of affect,” Journal of personality and social psychology, vol. 39, no. 6, p. 1161, 1980

	Participant 1	Participant 2	Participant 3	Participant 4
Age	67	55	60	63
Gender	Male	Female	Male	Female
Final Valid Days	9	15	12	13
Total EMAs	42	57	64	46

Facebook

Twitter

Click to copy link

Link copied

Cite

U.S. EPA Office of Research and Development (ORD) (2022). Code for Predicting MIEs from Gene Expression and Chemical Target Labels with Machine Learning (MIEML) [Dataset]. https://catalog.data.gov/dataset/code-for-predicting-mies-from-gene-expression-and-chemical-target-labels-with-machine-lear

Code for Predicting MIEs from Gene Expression and Chemical Target Labels with Machine Learning (MIEML)

Explore at:

Dataset updated

Apr 21, 2022

Dataset provided by

United States Environmental Protection Agencyhttp://www.epa.gov/

Description

Modeling data and analysis scripts generated during the current study are available in the github repository: https://github.com/USEPA/CompTox-MIEML. RefChemDB is available for download as supplemental material from its original publication (PMID: 30570668). LINCS gene expression data are publicly available and accessible through the gene expression omnibus (GSE92742 and GSE70138) at https://www.ncbi.nlm.nih.gov/geo/ . This dataset is associated with the following publication: Bundy, J., R. Judson, A. Williams, C. Grulke, I. Shah, and L. Everett. Predicting Molecular Initiating Events Using Chemical Target Annotations and Gene Expression. BioData Mining. BioMed Central Ltd, London, UK, issue}: 7, (2022).

Clear search

Close search

Google apps

Main menu

Code for Predicting MIEs from Gene Expression and Chemical Target Labels...

A collection of nine multi-label text classification datasets

Machine Learning Basics for Beginners🤖🧠

Data from: SANAD: Single-Label Arabic News Articles Dataset for Automatic...

Multi-label code-smell dataset

Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human...

Color Label Dataset

Color Label

multi-label-web-categorization

UCI and OpenML Data Sets for Ordinal Quantification

3D Microvascular Image Data and Labels for Machine Learning

MAAD : Multi-Label Arabic Articles Dataset

Data from: Machine Learning and Deep Learning Techniques for Colocated MIMO...

Data from: Training dataset for NABat Machine Learning V1.0

Multi-Label Datasets with Missing Values

Data from: MLRSNet: A Multi-label High Spatial Resolution Remote Sensing...

RTAnews: A Benchmark for Multi-label Arabic Text Categorization

ArXiv CS Papers Multi-Label Classification (200K)

BUTTER - Empirical Deep Learning Dataset