100+ datasets found
  1. Code for Predicting MIEs from Gene Expression and Chemical Target Labels...

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Apr 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2022). Code for Predicting MIEs from Gene Expression and Chemical Target Labels with Machine Learning (MIEML) [Dataset]. https://catalog.data.gov/dataset/code-for-predicting-mies-from-gene-expression-and-chemical-target-labels-with-machine-lear
    Explore at:
    Dataset updated
    Apr 21, 2022
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Modeling data and analysis scripts generated during the current study are available in the github repository: https://github.com/USEPA/CompTox-MIEML. RefChemDB is available for download as supplemental material from its original publication (PMID: 30570668). LINCS gene expression data are publicly available and accessible through the gene expression omnibus (GSE92742 and GSE70138) at https://www.ncbi.nlm.nih.gov/geo/ . This dataset is associated with the following publication: Bundy, J., R. Judson, A. Williams, C. Grulke, I. Shah, and L. Everett. Predicting Molecular Initiating Events Using Chemical Target Annotations and Gene Expression. BioData Mining. BioMed Central Ltd, London, UK, issue}: 7, (2022).

  2. i

    A collection of nine multi-label text classification datasets

    • ieee-dataport.org
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yiming Wang (2024). A collection of nine multi-label text classification datasets [Dataset]. https://ieee-dataport.org/documents/collection-nine-multi-label-text-classification-datasets
    Explore at:
    Dataset updated
    Nov 4, 2024
    Authors
    Yiming Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    RCV1

  3. Machine Learning Basics for Beginners🤖🧠

    • kaggle.com
    zip
    Updated Jun 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhanupratap Biswas (2023). Machine Learning Basics for Beginners🤖🧠 [Dataset]. https://www.kaggle.com/datasets/bhanupratapbiswas/machine-learning-basics-for-beginners
    Explore at:
    zip(492015 bytes)Available download formats
    Dataset updated
    Jun 22, 2023
    Authors
    Bhanupratap Biswas
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:

    1. Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.

    2. Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.

    3. Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.

    4. Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).

    5. Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).

    6. Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.

    7. Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.

    8. Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.

    9. Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.

    10. Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.

    These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.

  4. m

    Data from: SANAD: Single-Label Arabic News Articles Dataset for Automatic...

    • data.mendeley.com
    Updated Sep 2, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omar Einea (2019). SANAD: Single-Label Arabic News Articles Dataset for Automatic Text Categorization [Dataset]. http://doi.org/10.17632/57zpx667y9.2
    Explore at:
    Dataset updated
    Sep 2, 2019
    Authors
    Omar Einea
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SANAD Dataset is a large collection of Arabic news articles that can be used in different Arabic NLP tasks such as Text Classification and Word Embedding. The articles were collected using Python scripts written specifically for three popular news websites: AlKhaleej, AlArabiya and Akhbarona.

    All datasets have seven categories [Culture, Finance, Medical, Politics, Religion, Sports and Tech], except AlArabiya which doesn’t have [Religion]. SANAD contains a total number of 190k+ articles.

    How to use it:

    1. Unzip compressed resources.
    2. Each folder contains 6-7 sub-folders which are labeled by the category's name.
    3. Each sub-folder contains a set of article files corresponding to its category.

    SANAD_SUBSET is a balanced benchmark dataset (from SANAD) that is used in our research work. It contains the training (90%) and testing (10%) sets.

    How to use it:

    1. Unzip the compressed file.
    2. There are 3 main folders containing the 3 datasets: Akhbarona, Khaleej, and Arabiya.
    3. Each dataset-folder contains 2 sub-folders: training and testing.
    4. The training and testing folders include the balanced categories sub-folders.
  5. Multi-label code-smell dataset

    • figshare.com
    txt
    Updated Aug 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Binh Nguyen Thanh (2023). Multi-label code-smell dataset [Dataset]. http://doi.org/10.6084/m9.figshare.24024591.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 24, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Binh Nguyen Thanh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The multi-label code-smell dataset for studies related to multi-label classification

  6. Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human...

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Sep 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2025). Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models [Dataset]. https://catalog.data.gov/dataset/dataset-an-open-combinatorial-diffraction-dataset-including-consensus-human-and-machine-le
    Explore at:
    Dataset updated
    Sep 30, 2025
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    The open dataset, software, and other files accompanying the manuscript "An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models," submitted for publication to Integrated Materials and Manufacturing Innovations.Machine learning and autonomy are increasingly prevalent in materials science, but existing models are often trained or tuned using idealized data as absolute ground truths. In actual materials science, "ground truth" is often a matter of interpretation and is more readily determined by consensus. Here we present the data, software, and other files for a study using as-obtained diffraction data as a test case for evaluating the performance of machine learning models in the presence of differing expert opinions. We demonstrate that experts with similar backgrounds can disagree greatly even for something as intuitive as using diffraction to identify the start and end of a phase transformation. We then use a logarithmic likelihood method to evaluate the performance of machine learning models in relation to the consensus expert labels and their variance. We further illustrate this method's efficacy in ranking a number of state-of-the-art phase mapping algorithms. We propose a materials data challenge centered around the problem of evaluating models based on consensus with uncertainty. The data, labels, and code used in this study are all available online at data.gov, and the interested reader is encouraged to replicate and improve the existing models or to propose alternative methods for evaluating algorithmic performance.

  7. R

    Color Label Dataset

    • universe.roboflow.com
    zip
    Updated May 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    deep learning (2022). Color Label Dataset [Dataset]. https://universe.roboflow.com/deep-learning-dcksd/color-label
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 22, 2022
    Dataset authored and provided by
    deep learning
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Color Bottle Bounding Boxes
    Description

    Color Label

    ## Overview
    
    Color Label is a dataset for object detection tasks - it contains Color Bottle annotations for 2,258 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  8. h

    multi-label-web-categorization

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Taimur, multi-label-web-categorization [Dataset]. https://huggingface.co/datasets/tshasan/multi-label-web-categorization
    Explore at:
    Authors
    Taimur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Multi-Label Web Page Classification Dataset

      Dataset Description
    

    The Multi-Label Web Page Classification Dataset is a curated dataset containingweb page titles and snippets, extracted from the CC-Meta25-1M dataset. Each entry has been automatically categorized into multiple predefined categories using ChatGPT-4o-mini. This dataset is designed for multi-label text classification tasks, making it ideal for training and evaluating machine learning models in web content… See the full description on the dataset page: https://huggingface.co/datasets/tshasan/multi-label-web-categorization.

  9. UCI and OpenML Data Sets for Ordinal Quantification

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jul 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

    With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

    We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

    Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

    Usage

    You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

    Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

    Data Extraction: In your terminal, you can call either

    make

    (recommended), or

    julia --project="." --eval "using Pkg; Pkg.instantiate()"
    julia --project="." extract-oq.jl

    Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

    Further Reading

    Implementation of our experiments: https://github.com/mirkobunse/regularized-oq

  10. u

    3D Microvascular Image Data and Labels for Machine Learning

    • rdr.ucl.ac.uk
    • datasetcatalog.nlm.nih.gov
    bin
    Updated Apr 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Natalie Holroyd; Claire Walsh; Emmeline Brown; Emma Brown; Yuxin Zhang; Carles Bosch Pinol; Simon Walker-Samuel (2024). 3D Microvascular Image Data and Labels for Machine Learning [Dataset]. http://doi.org/10.5522/04/25715604.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 30, 2024
    Dataset provided by
    University College London
    Authors
    Natalie Holroyd; Claire Walsh; Emmeline Brown; Emma Brown; Yuxin Zhang; Carles Bosch Pinol; Simon Walker-Samuel
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    These images and associated binary labels were collected from collaborators across multiple universities to serve as a diverse representation of biomedical images of vessel structures, for use in the training and validation of machine learning tools for vessel segmentation. The dataset contains images from a variety of imaging modalities, at different resolutions, using difference sources of contrast and featuring different organs/ pathologies. This data was use to train, test and validated a foundational model for 3D vessel segmentation, tUbeNet, which can be found on github. The paper descripting the training and validation of the model can be found here. Filenames are structured as follows: Data - [Modality]_[species Organ]_[resolution].tif Labels - [Modality]_[species Organ]_[resolution]_labels.tif Sub-volumes of larger dataset - [Modality]_[species Organ]_subvolume[dimensions in pixels].tif Manual labelling of blood vessels was carried out using Amira (2020.2, Thermo-Fisher, UK). Training data: opticalHREM_murineLiver_2.26x2.26x1.75um.tif: A high resolution episcopic microscopy (HREM) dataset, acquired in house by staining a healthy mouse liver with Eosin B and imaged using a standard HREM protocol. NB: 25% of this image volume was withheld from training, for use as test data. CT_murineTumour_20x20x20um.tif: X-ray microCT images of a microvascular cast, taken from a subcutaneous mouse model of colorectal cancer (acquired in house). NB: 25% of this image volume was withheld from training, for use as test data. RSOM_murineTumour_20x20um.tif: Raster-Scanning Optoacoustic Mesoscopy (RSOM) data from a subcutaneous tumour model (provided by Emma Brown, Bohndiek Group, University of Cambridge). The image data has undergone filtering to reduce the background ​(Brown et al., 2019)​. OCTA_humanRetina_24x24um.tif: retinal angiography data obtained using Optical Coherence Tomography Angiography (OCT-A) (provided by Dr Ranjan Rajendram, Moorfields Eye Hospital). Test data: MRI_porcineLiver_0.9x0.9x5mm.tif: T1-weighted Balanced Turbo Field Echo Magnetic Resonance Imaging (MRI) data from a machine-perfused porcine liver, acquired in-house. Test Data MFHREM_murineTumourLectin_2.76x2.76x2.61um.tif: a subcutaneous colorectal tumour mouse model was imaged in house using Multi-fluorescence HREM in house, with Dylight 647 conjugated lectin staining the vasculature ​(Walsh et al., 2021)​. The image data has been processed using an asymmetric deconvolution algorithm described by ​Walsh et al., 2020​. NB: A sub-volume of 480x480x640 voxels was manually labelled (MFHREM_murineTumourLectin_subvolume480x480x640.tif). MFHREM_murineBrainLectin_0.85x0.85x0.86um.tif: an MF-HREM image of the cortex of a mouse brain, stained with Dylight-647 conjugated lectin, was acquired in house ​(Walsh et al., 2021)​. The image data has been downsampled and processed using an asymmetric deconvolution algorithm described by ​Walsh et al., 2020​. NB: A sub-volume of 1000x1000x99 voxels was manually labelled. This sub-volume is provided at full resolution and without preprocessing (MFHREM_murineBrainLectin_subvol_0.57x0.57x0.86um.tif). 2Photon_murineOlfactoryBulbLectin_0.2x0.46x5.2um.tif: two-photon data of mouse olfactory bulb blood vessels, labelled with sulforhodamine 101, was kindly provided by Yuxin Zhang at the Sensory Circuits and Neurotechnology Lab, the Francis Crick Institute ​(Bosch et al., 2022)​. NB: A sub-volume of 500x500x79 voxel was manually labelled (2Photon_murineOlfactoryBulbLectin_subvolume500x500x79.tif). References: ​​Bosch, C., Ackels, T., Pacureanu, A., Zhang, Y., Peddie, C. J., Berning, M., Rzepka, N., Zdora, M. C., Whiteley, I., Storm, M., Bonnin, A., Rau, C., Margrie, T., Collinson, L., & Schaefer, A. T. (2022). Functional and multiscale 3D structural investigation of brain tissue through correlative in vivo physiology, synchrotron microtomography and volume electron microscopy. Nature Communications 2022 13:1, 13(1), 1–16. https://doi.org/10.1038/s41467-022-30199-6 ​Brown, E., Brunker, J., & Bohndiek, S. E. (2019). Photoacoustic imaging as a tool to probe the tumour microenvironment. DMM Disease Models and Mechanisms, 12(7). https://doi.org/10.1242/DMM.039636 ​Walsh, C., Holroyd, N. A., Finnerty, E., Ryan, S. G., Sweeney, P. W., Shipley, R. J., & Walker-Samuel, S. (2021). Multifluorescence High-Resolution Episcopic Microscopy for 3D Imaging of Adult Murine Organs. Advanced Photonics Research, 2(10), 2100110. https://doi.org/10.1002/ADPR.202100110 ​Walsh, C., Holroyd, N., Shipley, R., & Walker-Samuel, S. (2020). Asymmetric Point Spread Function Estimation and Deconvolution for Serial-Sectioning Block-Face Imaging. Communications in Computer and Information Science, 1248 CCIS, 235–249. https://doi.org/10.1007/978-3-030-52791-4_19 ​ 

  11. m

    MAAD : Multi-Label Arabic Articles Dataset

    • data.mendeley.com
    Updated Oct 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marwah Yahya Al-Nahari (2025). MAAD : Multi-Label Arabic Articles Dataset [Dataset]. http://doi.org/10.17632/hbfc9j8hj8.2
    Explore at:
    Dataset updated
    Oct 27, 2025
    Authors
    Marwah Yahya Al-Nahari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The MAAD dataset represents a comprehensive collection of Arabic news articles that may be employed across a diverse array of Arabic Natural Language Processing (NLP) applications, including but not limited to classification, text generation, summarization, and various other tasks. The dataset was diligently assembled through the application of specifically designed Python scripts that targeted six prominent news platforms: Al Jazeera, BBC Arabic, Youm7, Russia Today, and Al Ummah, in conjunction with regional and local media outlets, ultimately resulting in a total of 602,792 articles. This dataset exhibits a total word count of 29,371,439, with the number of unique words totaling 296,518; the average word length has been determined to be 6.36 characters, while the mean article length is calculated at 736.09 characters. This extensive dataset is categorized into ten distinct classifications: Political, Economic, Cultural, Arts, Sports, Health, Technology, Community, Incidents, and Local. The data fields are categorized into five distinct types: Title, Article, Summary, Category, and Published_ Date. The MAAD dataset is structured into six files, each named after the corresponding news outlets from which the data was sourced; within each directory, text files are provided, containing the number of categories represented in a single file, formatted in txt to accommodate all news articles. This dataset serves as an expansive standard resource designed for utilization within the context of our research endeavors.

  12. Data from: Machine Learning and Deep Learning Techniques for Colocated MIMO...

    • figshare.com
    bin
    Updated Mar 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alessandro Davoli; Giorgio Guerzoni; Giorgio Matteo Vitetta (2025). Machine Learning and Deep Learning Techniques for Colocated MIMO Radars: A Tutorial Overview [Dataset]. http://doi.org/10.6084/m9.figshare.28574234.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 11, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Alessandro Davoli; Giorgio Guerzoni; Giorgio Matteo Vitetta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Last update: February 2021.The dataset folder includes both raw and post-processed radar data used for training and testing the networks proposed in Sect. VIII of the article “Machine Learning and Deep Learning Techniques for Colocated MIMO Radars: A Tutorial Overview”.The folder Human Activity Classification contains“Raw” folder where 150 files acquired with our FMCW radar sensor are given inside the “doppler_dataset” zip folder; they are divided in 50 for walking, 50 for jumping and 50 for running;“Post_process” divided in- “Machine Learning” folder including “dataset_ML_doppler_real_activities.mat”; this dataset has been used for training and testing the SVM, K-NN and Adaboost described in Sect. VIII-A).- The 150x4 matrix “X_meas” including the features described by eqs. (227)-(234) and the 150x1 vector of char “labels_py” containing the associated labels.- “Deep Learning” folder containing “dataset_DL_doppler_real_activities.mat”; this dataset is composed by 150 structs of data, where each of them, associated to a specific activity, includes:- The label associated to the considered activity,- The overall range variations from the beginning to the end of the motion “delta_R”;- The Range-Doppler map “RD_map”;- The normalized spectrogram “SP_Norm”;- The Cadence Velocity Diagram “CVD”;- The period of the spectrogram “per”;- The peaks associated to the greatest three cadence frequencies in “peaks_cad”;- The three strongest cadence frequencies and their normalized version in “cad_freqs” and “cad_freqs_norm”;- The strongest cadence frequency “c1”;- The three velocity profiles associated to the three strongest cadence frequencies “matr_vex”.The spectrogram images (SP_Norm) contained in this dataset were used for training and testing the CNN in Sect. VIII-A).The folder Obstacle Detection contains“Raw” folder where raw data acquired with our radar system and TOF camera in the presence of a multi target or single target scenario are given inside the “obst_detect_Raw_mat” zip folder. It’s important to note that each radar frame and each TOF camera image have their own time stamp, but since they come from different sensor have to be synchronized.“Post_process” divided in- “Neural Net” folder containing “inputs_bis_1.mat” and “t_d_1.mat”, where- “inputs_bis_1.mat” contains the vectors of features of size 32x1, used for training and testing the feed-forward neural network described in Sect. VIII-B) (see eqs. (243)-(251)),- “t_d_1.mat” contains the associated 2x1 vectors of labels (see eq. (235)).- “Yolo v2” folder containing the folder “Dataset_YOLO” and the table “obj_dataset_tab”, where- “Dataset_YOLO_v2” contains (inside the sub-folder “obj_dataset”) the Range-Azimuth maps used for training the YOLO v2 network (see eqs. (257)-(258) and Fig. 30);- “obj_dataset_tab” contains the path, the bounding box and the label associated to the Range-Azimuth maps (see eq. (256)-(266)).Cite as: A. Davoli, G. Guerzoni and G. M. Vitetta, "Machine Learning and Deep Learning Techniques for Colocated MIMO Radars: A Tutorial Overview," in IEEE Access, vol. 9, pp. 33704-33755, 2021, doi: 10.1109/ACCESS.2021.3061424.

  13. d

    Data from: Training dataset for NABat Machine Learning V1.0

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
    Explore at:
    Dataset updated
    Nov 26, 2025
    Dataset provided by
    U.S. Geological Survey
    Description

    Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

  14. Z

    Multi-Label Datasets with Missing Values

    • data.niaid.nih.gov
    Updated Mar 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio F. L. Jacob Jr.; Fabrício A. do Carmo; Ádamo L. de Santana; Ewaldo Santana; Fåbio M. F. Lobato (2023). Multi-Label Datasets with Missing Values [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7748932
    Explore at:
    Dataset updated
    Mar 19, 2023
    Dataset provided by
    UEMA
    UFOPA
    Fuji Electric Co. Ltd.
    Authors
    Antonio F. L. Jacob Jr.; Fabrício A. do Carmo; Ádamo L. de Santana; Ewaldo Santana; Fåbio M. F. Lobato
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Consisting of six multi-label datasets from the UCI Machine Learning repository.

    Each dataset contains missing values which have been artificially added at the following rates: 5, 10, 15, 20, 25, and 30%. The “amputation” was performed using the “Missing Completely at Random” mechanism.

    File names are represented as follows:

       amp_DB_MR.arff
    

    where:

       DB = original dataset;
    
    
       MR = missing rate.
    

    For more details, please read:

    IEEE Access article (in review process)

  15. m

    Data from: MLRSNet: A Multi-label High Spatial Resolution Remote Sensing...

    • data.mendeley.com
    Updated Sep 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaoman Qi (2023). MLRSNet: A Multi-label High Spatial Resolution Remote Sensing Dataset for Semantic Scene Understanding [Dataset]. http://doi.org/10.17632/7j9bv9vwsx.4
    Explore at:
    Dataset updated
    Sep 18, 2023
    Authors
    Xiaoman Qi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MLRSNet provides different perspectives of the world captured from satellites. That is, it is composed of high spatial resolution optical satellite images. MLRSNet contains 109,161 remote sensing images that are annotated into 46 categories, and the number of sample images in a category varies from 1,500 to 3,000. The images have a fixed size of 256×256 pixels with various pixel resolutions (~10m to 0.1m). Moreover, each image in the dataset is tagged with several of 60 predefined class labels, and the number of labels associated with each image varies from 1 to 13. The dataset can be used for multi-label based image classification, multi-label based image retrieval, and image segmentation.

    The Dataset includes: 1. Images folder: 46 categories, 109,161 high-spatial resolution remote sensing images. 2. Labels folders: each category has a .csv file. 3. Categories_names. xlsx: Sheet1 lists the names of 46 categories, and the Sheet2 shows the associated multi-label to each category.

  16. m

    RTAnews: A Benchmark for Multi-label Arabic Text Categorization

    • data.mendeley.com
    • semantichub.ijs.si
    Updated Aug 18, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bassam Al-Salemi (2018). RTAnews: A Benchmark for Multi-label Arabic Text Categorization [Dataset]. http://doi.org/10.17632/322pzsdxwy.1
    Explore at:
    Dataset updated
    Aug 18, 2018
    Authors
    Bassam Al-Salemi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    RTAnews dataset is a collections of multi-label Arabic texts, collected form Russia Today in Arabic news portal. It consists of 23,837 texts (news articles) distributed over 40 categories, and divided into 15,001 texts for the training and 8,836 texts for the test.

    The original dataset (without preprocessing), a preprocessed version of the dataset, versions of the dataset in MEKA and Mulan formats, single-label version, and WEAK version all are available.

    For any enquiry or support regarding the dataset, please feel free to contact us via bassalemi at gmail dot com

  17. ArXiv CS Papers Multi-Label Classification (200K)

    • kaggle.com
    zip
    Updated Jun 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sharukh Rahman (2023). ArXiv CS Papers Multi-Label Classification (200K) [Dataset]. https://www.kaggle.com/datasets/devintheai/arxiv-cs-papers-multi-label-classification-200k-v1
    Explore at:
    zip(83841332 bytes)Available download formats
    Dataset updated
    Jun 7, 2023
    Authors
    Sharukh Rahman
    Description

    The ArXiv CS Papers Multi-Label Classification dataset is a comprehensive collection of research papers from the computer science domain. This dataset is intended for multi-label classification tasks and contains a diverse range of research papers spanning various topics within computer science.

    The dataset consists of approximately 200,000+ research papers and includes the following columns:

    • Paper ID: A unique identifier for each research paper in the dataset.
    • Title: The title of the research paper.
    • Abstract: A brief summary or abstract of the research paper.
    • Year: The publication year of the research paper.
    • Primary Category: The primary category of the research paper, representing the main topic or area of focus.
    • Categories: Additional categories or subtopics associated with the research paper.

    This dataset is well-suited for tasks related to text classification, topic modeling, information retrieval, and other natural language processing (NLP) tasks. Researchers and practitioners can leverage this dataset to develop and evaluate machine learning models for multi-label classification on a wide range of computer science topics.

    Note: Please refer to the original ArXiv repository for access to the full-text content of the papers and proper citation guidelines. This dataset contains metadata and should be used for research and educational purposes only.

    We hope that the ArXiv CS Papers Multi-Label Classification dataset serves as a valuable resource for researchers, data scientists, and machine learning enthusiasts in their quest to advance knowledge and understanding in the field of computer science.

  18. O

    BUTTER - Empirical Deep Learning Dataset

    • data.openei.org
    • datasets.ai
    • +2more
    code, data, website
    Updated May 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek; Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek (2022). BUTTER - Empirical Deep Learning Dataset [Dataset]. http://doi.org/10.25984/1872441
    Explore at:
    code, website, dataAvailable download formats
    Dataset updated
    May 20, 2022
    Dataset provided by
    National Renewable Energy Laboratory
    Open Energy Data Initiative (OEDI)
    USDOE Office of Energy Efficiency and Renewable Energy (EERE), Multiple Programs (EE)
    Authors
    Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek; Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The BUTTER Empirical Deep Learning Dataset represents an empirical study of the deep learning phenomena on dense fully connected networks, scanning across thirteen datasets, eight network shapes, fourteen depths, twenty-three network sizes (number of trainable parameters), four learning rates, six minibatch sizes, four levels of label noise, and fourteen levels of L1 and L2 regularization each. Multiple repetitions (typically 30, sometimes 10) of each combination of hyperparameters were preformed, and statistics including training and test loss (using a 80% / 20% shuffled train-test split) are recorded at the end of each training epoch. In total, this dataset covers 178 thousand distinct hyperparameter settings ("experiments"), 3.55 million individual training runs (an average of 20 repetitions of each experiments), and a total of 13.3 billion training epochs (three thousand epochs were covered by most runs). Accumulating this dataset consumed 5,448.4 CPU core-years, 17.8 GPU-years, and 111.2 node-years.

  19. Drug Labels & Side Effects Dataset | 1400+ Records

    • kaggle.com
    zip
    Updated Aug 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pratyush Puri (2025). Drug Labels & Side Effects Dataset | 1400+ Records [Dataset]. https://www.kaggle.com/datasets/pratyushpuri/drug-labels-and-side-effects-dataset-1400-records
    Explore at:
    zip(51886 bytes)Available download formats
    Dataset updated
    Aug 2, 2025
    Authors
    Pratyush Puri
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Drug Labels and Side Effects Dataset

    Dataset Overview

    This comprehensive pharmaceutical synthetic dataset contains 1,393 records of synthetic drug information with 15 columns, designed for data science projects focusing on healthcare analytics, drug safety analysis, and pharmaceutical research. The dataset simulates real-world pharmaceutical data with appropriate variety and realistic constraints for machine learning applications.

    Dataset Specifications

    AttributeValue
    Total Records1,393
    Total Columns15
    File FormatCSV
    Data TypesMixed (intentional for data cleaning practice)
    DomainPharmaceutical/Healthcare
    Use CaseML Training, Data Analysis, Healthcare Research

    Column Specifications

    Categorical Features

    Column NameData TypeUnique ValuesDescriptionExample Values
    drug_nameObject1,283 uniquePharmaceutical drug names with realistic naming patterns"Loxozepam32", "Amoxparin43", "Virazepam10"
    manufacturerObject10 uniqueMajor pharmaceutical companiesPfizer Inc., AstraZeneca, Johnson & Johnson
    drug_classObject10 uniqueTherapeutic drug classificationsAntibiotic, Analgesic, Antidepressant, Vaccine
    indicationsObject10 uniqueMedical conditions the drug treats"Pain relief", "Bacterial infections", "Depression treatment"
    side_effectsObject434 uniqueCombination of side effects (1-3 per drug)"Nausea, Dizziness", "Headache, Fatigue, Rash"
    administration_routeObject7 uniqueMethod of drug deliveryOral, Intravenous, Topical, Inhalation, Sublingual
    contraindicationsObject10 uniqueMedical warnings for drug usage"Pregnancy", "Heart disease", "Liver disease"
    warningsObject10 uniqueSafety instructions and precautions"Take with food", "Avoid alcohol", "Monitor blood pressure"
    batch_numberObject1,393 uniqueManufacturing batch identifiers"xr691zv", "Ye266vU", "Rm082yX"
    expiry_dateObject782 uniqueDrug expiration dates (YYYY-MM-DD)"2025-12-13", "2027-03-09", "2026-10-06"
    side_effect_severityObject3 uniqueSeverity classificationMild, Moderate, Severe
    approval_statusObject3 uniqueRegulatory approval statusApproved, Pending, Rejected

    Numerical Features

    Column NameData TypeRangeMeanStd DevDescription
    approval_yearFloat/String*1990-20242006.710.0FDA/regulatory approval year
    dosage_mgFloat/String*10-990 mg499.7290.0Medication strength in milligrams
    price_usdFloat/String*$2.32-$499.24$251.12$144.81Drug price in US dollars

    *Intentionally stored as mixed types for data cleaning practice

    Key Statistics

    Manufacturer Distribution

    ManufacturerCountPercentage
    Pfizer Inc.17012.2%
    AstraZeneca~140~10.0%
    Merck & Co.~140~10.0%
    Johnson & Johnson~140~10.0%
    GlaxoSmithKline~140~10.0%
    Others~623~44.8%

    Drug Class Distribution

    Drug ClassCountMost Common
    Anti-inflammatory154✓
    Antibiotic~140
    Antidepressant~140
    Antiviral~140
    Vaccine~140
    Others~679

    Side Effect Severity

    SeverityCountPercentage
    Severe48835.0%
    Moderate~453~32.5%
    Mild~452~32.5%

    Potential Use Cases

    1. Machine Learning Applications

    • Drug Approval Prediction: Predict approval likelihood based on drug characteristics
    • Price Prediction: Estimate drug pricing using features like class, manufacturer, dosage
    • Side Effect Classification: Classify severity based on drug properties
    • Market Success Analysis: Analyze factors contributing to drug market performance

    2. Data Engineering Projects

    • ETL Pipeline Development: Practice data cleaning and transformation
    • Data Quality Assessment: Implement data validation and quality checks
    • Database Design: Create normalized pharmaceutical database schema
    • Real-time Processing: Stream processing for drug monitoring systems

    3. Business Intelligence

    • Pharmaceutical Market Analysis: Manufacturer market share and competitive analysis
    • Drug Safety Analytics: Side effect patterns and safety profile analysis
    • Regulatory Compliance: Approval trends and regulatory timeline analysis
    • Pricing Strategy: Competitive pricing analysis across drug classes

    Recommended Next Steps

    1. Data Cleaning Pipeline: Implement comprehe...
  20. Physiological signals during activities for daily life: Dataset

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Mar 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduardo Gutierrez Maestro; Eduardo Gutierrez Maestro (2022). Physiological signals during activities for daily life: Dataset [Dataset]. http://doi.org/10.5281/zenodo.6391454
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 29, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Eduardo Gutierrez Maestro; Eduardo Gutierrez Maestro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset used in this work is composed by four participants, two men and two women. Each of them carried the wearable device Empatica E4 for a total number of 15 days. They carried the wearable during the day, and during the nights we asked participants to charge and load the data into an external memory unit. During these days, participants were asked to answer EMA questionnaires which are used to label our data. However, some participants could not complete the full experiment or some days were discarded due to data corruption. Specific demographic information, total sampling days and total number of EMA answers can be found in table I.

    Participant 1Participant 2Participant 3Participant 4
    Age67556063
    GenderMaleFemaleMaleFemale

    Final Valid Days

    9151213
    Total EMAs42576446

    Table I. Summary of participants' collected data.

    This dataset provides three different type of labels. Activeness and happiness are two of these labels. These are the answers to EMA questionnaires that participants reported during their daily activities. These labels are numbers between 0 and 4.
    These labels are used to interpolate the mental well-being state according to [1] We report in our dataset a total number of eight emotional states: (1) pleasure, (2) excitement, (3) arousal, (4) distress, (5) misery, (6) depression, (7) sleepiness, and (8) contentment.

    The data we provide in this repository consist of two type of files:

    • CSV files: These files contain physiological signals recorded during the data collection process. The first line of each CSV file defines the timestamp by which data started being sampled. The second line defines the sampling frequency used for gathering the signal. From the third line until the end of the file, one can find sampled datapoints.
    • Excel files: These files contain the labels obtained from EMA answers. It is indicated the timestamp at which the answer was registered. Labels for pleasure, activeness and mood can be found in this file.

    NOTE: Files are numbered according to each specific sampling day. For example, ACC1.csv corresponds to the signal ACC for sampling day 1. The same applied to excel files.

    Code and a tutorial of how to labelled and extract features can be found in this repository: https://github.com/edugm94/temporal-feat-emotion-prediction

    References:

    [1] . A. Russell, “A circumplex model of affect,” Journal of personality and social psychology, vol. 39, no. 6, p. 1161, 1980

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
U.S. EPA Office of Research and Development (ORD) (2022). Code for Predicting MIEs from Gene Expression and Chemical Target Labels with Machine Learning (MIEML) [Dataset]. https://catalog.data.gov/dataset/code-for-predicting-mies-from-gene-expression-and-chemical-target-labels-with-machine-lear
Organization logo

Code for Predicting MIEs from Gene Expression and Chemical Target Labels with Machine Learning (MIEML)

Explore at:
Dataset updated
Apr 21, 2022
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description

Modeling data and analysis scripts generated during the current study are available in the github repository: https://github.com/USEPA/CompTox-MIEML. RefChemDB is available for download as supplemental material from its original publication (PMID: 30570668). LINCS gene expression data are publicly available and accessible through the gene expression omnibus (GSE92742 and GSE70138) at https://www.ncbi.nlm.nih.gov/geo/ . This dataset is associated with the following publication: Bundy, J., R. Judson, A. Williams, C. Grulke, I. Shah, and L. Everett. Predicting Molecular Initiating Events Using Chemical Target Annotations and Gene Expression. BioData Mining. BioMed Central Ltd, London, UK, issue}: 7, (2022).

Search
Clear search
Close search
Google apps
Main menu