41 datasets found
  1. Unlabeled AnuraSet: A dataset for leveraging unlabeled data in machine...

    • zenodo.org
    • data.niaid.nih.gov
    txt, zip
    Updated May 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan Sebastián Cañas; Juan Sebastián Cañas; Toro-Gómez María Paula; Moreira Sugai Larissa Sayuri; Moreira Sugai Larissa Sayuri; Luis Felipe Toledo; De Souza Franco Leandro; Neckel De Oliveira Selvino; Neckel De Oliveira Selvino; Pereira Bastos Rogerio; Pereira Bastos Rogerio; Llusia Diego; Llusia Diego; Ulloa Juan Sebastián; Ulloa Juan Sebastián; Toro-Gómez María Paula; Luis Felipe Toledo; De Souza Franco Leandro (2024). Unlabeled AnuraSet: A dataset for leveraging unlabeled data in machine learning models for passive acoustic monitoring [Dataset]. http://doi.org/10.5281/zenodo.11244814
    Explore at:
    zip, txtAvailable download formats
    Dataset updated
    May 27, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Juan Sebastián Cañas; Juan Sebastián Cañas; Toro-Gómez María Paula; Moreira Sugai Larissa Sayuri; Moreira Sugai Larissa Sayuri; Luis Felipe Toledo; De Souza Franco Leandro; Neckel De Oliveira Selvino; Neckel De Oliveira Selvino; Pereira Bastos Rogerio; Pereira Bastos Rogerio; Llusia Diego; Llusia Diego; Ulloa Juan Sebastián; Ulloa Juan Sebastián; Toro-Gómez María Paula; Luis Felipe Toledo; De Souza Franco Leandro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Unlabeled AnuraSet (U-AnuraSet) is an extension of the original AnuraSet dataset. It consists of soundscape recordings from passive acoustic monitoring conducted in Brazil. The recording sites are identical to those in the original AnuraSet. Each site comprises 2,666 one-minute raw audio files of unlabeled data. The U-AnuraSet is publicly available to encourage machine learning researchers to explore innovative methods for leveraging unlabeled data in the training of models aimed at solving problems such as anuran call identification.

    If you find the Unlabeled AnuraSet useful for your research, please consider citing it as follows:

    Cañas, J.S., Toro-Gómez, M.P., Sugai, L.S.M., et al. A dataset for benchmarking Neotropical anuran calls identification in passive acoustic monitoring. Sci Data 10, 771 (2023). https://doi.org/10.1038/s41597-023-02666-2

  2. f

    Explanations for each cluster in Iris dataset.

    • plos.figshare.com
    xls
    Updated Oct 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liang Chen; Caiming Zhong; Zehua Zhang (2023). Explanations for each cluster in Iris dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0292960.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 27, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Liang Chen; Caiming Zhong; Zehua Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Clustering is an unsupervised machine learning technique whose goal is to cluster unlabeled data. But traditional clustering methods only output a set of results and do not provide any explanations of the results. Although in the literature a number of methods based on decision tree have been proposed to explain the clustering results, most of them have some disadvantages, such as too many branches and too deep leaves, which lead to complex explanations and make it difficult for users to understand. In this paper, a hypercube overlay model based on multi-objective optimization is proposed to achieve succinct explanations of clustering results. The model designs two objective functions based on the number of hypercubes and the compactness of instances and then uses multi-objective optimization to find a set of nondominated solutions. Finally, an Utopia point is defined to determine the most suitable solution, in which each cluster can be covered by as few hypercubes as possible. Based on these hypercubes, an explanations of each cluster is provided. Upon verification on synthetic and real datasets respectively, it shows that the model can provide a concise and understandable explanations to users.

  3. O

    Mal-Activity

    • opendatalab.com
    • paperswithcode.com
    zip
    Updated Mar 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of New South Wales (2023). Mal-Activity [Dataset]. https://opendatalab.com/OpenDataLab/Mal-Activity
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 17, 2023
    Dataset provided by
    Nokia Bell Labs
    University of New South Wales
    Macquarie University
    University of Sydney
    Description

    This is a dataset of Internet malicious activity (mal-activity in short). It contains more than 51 million mal-activity reports involving 662K unique IP addresses covering the period form January 2007 to June 2017. Leveraging the Wayback Machine, antivirus (AV) tool reports and several additional public datasets (e.g., BGP Route Views and Internet registries) the data is enriched with historical meta-information including geo-locations (countries), autonomous system (AS) numbers and types of mal-activity. An initially labelled dataset of approx 1.57 million mal-activities (obtained from public blacklists) is used to train a machine learning classifier to classify the remaining unlabeled dataset of approx 44 million mal-activities obtained through additional sources.

  4. f

    Setting of parameters.

    • plos.figshare.com
    xls
    Updated Oct 27, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liang Chen; Caiming Zhong; Zehua Zhang (2023). Setting of parameters. [Dataset]. http://doi.org/10.1371/journal.pone.0292960.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 27, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Liang Chen; Caiming Zhong; Zehua Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Clustering is an unsupervised machine learning technique whose goal is to cluster unlabeled data. But traditional clustering methods only output a set of results and do not provide any explanations of the results. Although in the literature a number of methods based on decision tree have been proposed to explain the clustering results, most of them have some disadvantages, such as too many branches and too deep leaves, which lead to complex explanations and make it difficult for users to understand. In this paper, a hypercube overlay model based on multi-objective optimization is proposed to achieve succinct explanations of clustering results. The model designs two objective functions based on the number of hypercubes and the compactness of instances and then uses multi-objective optimization to find a set of nondominated solutions. Finally, an Utopia point is defined to determine the most suitable solution, in which each cluster can be covered by as few hypercubes as possible. Based on these hypercubes, an explanations of each cluster is provided. Upon verification on synthetic and real datasets respectively, it shows that the model can provide a concise and understandable explanations to users.

  5. UCI and OpenML Data Sets for Ordinal Quantification

    • zenodo.org
    zip
    Updated Jul 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

    With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

    We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

    Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

    Usage

    You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

    Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

    Data Extraction: In your terminal, you can call either

    make

    (recommended), or

    julia --project="." --eval "using Pkg; Pkg.instantiate()"
    julia --project="." extract-oq.jl

    Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

    Further Reading

    Implementation of our experiments: https://github.com/mirkobunse/regularized-oq

  6. Data from: Exploring deep learning techniques for wild animal behaviour...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Feb 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa (2024). Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers [Dataset]. http://doi.org/10.5061/dryad.2ngf1vhwk
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 22, 2024
    Dataset provided by
    Osaka University
    Nagoya University
    Authors
    Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, using datasets from two species of wild seabirds and state‐of‐the‐art deep learning model architectures. Data augmentation improved the overall model performance when one of the various techniques (none, scaling, jittering, permutation, time‐warping and rotation) was randomly applied to each data during mini‐batch training. Manifold mixup also improved model performance, but not as much as random data augmentation. Pre‐training with unlabelled data did not improve model performance. The state‐of‐the‐art deep learning models, including a model consisting of four CNN layers, an LSTM layer and a multi‐head attention layer, as well as its modified version with shortcut connection, showed better performance among other comparative models. Using only raw acceleration data as inputs, these models outperformed classic machine learning approaches that used 119 handcrafted features. Our experiments showed that deep learning techniques are promising for acceleration‐based behaviour classification of wild animals and highlighted some challenges (e.g. effective use of unlabelled data). There is scope for greater exploration of deep learning techniques in wild animal studies (e.g. advanced data augmentation, multimodal sensor data use, transfer learning and self‐supervised learning). We hope that this study will stimulate the development of deep learning techniques for wild animal behaviour classification using time‐series sensor data.

    This abstract is cited from the original article "Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers" in Methods in Ecology and Evolution (Otsuka et al., 2024).Please see README for the details of the datasets.

  7. d

    Data and code from: Learning a deep language model for microbiomes: The...

    • search.dataone.org
    • data.niaid.nih.gov
    • +2more
    Updated Feb 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Quintin Pope; Rohan Varma; Christine Tataru; Maude David; Xiaoli Fern (2025). Data and code from: Learning a deep language model for microbiomes: The power of large scale unlabeled microbiome data [Dataset]. https://search.dataone.org/view/sha256%3A8c00b06c01187eb7b0df45066ab9152b765faa96c82b52a4ce6e18de1632a2b9
    Explore at:
    Dataset updated
    Feb 26, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Quintin Pope; Rohan Varma; Christine Tataru; Maude David; Xiaoli Fern
    Description

    We use open source human gut microbiome data to learn a microbial “language†model by adapting techniques from Natural Language Processing (NLP). Our microbial “language†model is trained in a self-supervised fashion (i.e., without additional external labels) to capture the interactions among different microbial taxa and the common compositional patterns in microbial communities. The learned model produces contextualized taxon representations that allow a single microbial taxon to be represented differently according to the specific microbial environment in which it appears. The model further provides a sample representation by collectively interpreting different microbial taxa in the sample and their interactions as a whole. We demonstrate that, while our sample representation performs comparably to baseline models in in-domain prediction tasks such as predicting Irritable Bowel Disease (IBD) and diet patterns, it significantly outperforms them when generalizing to test data from indep..., No additional raw data was collected for this project. All inputs are available publicly. American Gut Project, Halfvarson, and Schirmer raw data are available from the NCBI database (accession numbers PRJEB11419, PRJEB18471, and PRJNA398089, respectively). We used the curated data produced by Tataru and David, 2020., , # Code and data for "Learning a deep language model for microbiomes: the power of large scale unlabeled microbiome data"

    Data:

    • vocab_embeddings.npy
    • microbiomedata.zip
      • Contains the labels and data for the three datasets used in this study. Specifically, it includes:
      • IBD_(test|train)*(512|otu).npy and IBD*(test|train)_labels.npy
      • halfvarson_(512_otu|otu).npy and halfvarson_IBD_labels.npy
      • schirmer_IBD_(512_otu|otu).npy and schirmer_IBD_labels.npy
      • (test|train)encodings_(512|1897).npy
      • The data are stored as n_samples x max_sample_size x 2 numpy arrays, containing both the vocab IDs of the taxa in the ...
  8. Product Reviews for Ordinal Quantification

    • zenodo.org
    zip
    Updated Oct 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). Product Reviews for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.7081208
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 4, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
    Description

    This data set comprises a labeled training set, validation samples, and testing samples for ordinal quantification. It appears in our research paper "Ordinal Quantification Through Regularization", which we have published at ECML-PKDD 2022.

    The data is extracted from the McAuley data set of product reviews in Amazon, where the goal is to predict the 5-star rating of each textual review. We have sampled this data according to two protocols that are suited for quantification research. The goal of quantification is not to predict the star rating of each individual instance, but the distribution of ratings in sets of textual reviews. More generally speaking, quantification aims at estimating the distribution of labels in unlabeled samples of data.

    The first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification, where classes are ordered and a similarity of neighboring classes can be assumed. 5-star ratings of product reviews lie on an ordinal scale and, hence, pose such an ordinal quantification task.

    This data set comprises two representations of the McAuley data. The first representation consists of TF-IDF features. The second representation is a RoBERTa embedding. This second representation is dense, while the first is sparse. In our experience, logistic regression classifiers work well with both representations. RoBERTa embeddings yield more accurate predictors than the TF-IDF features.

    You can extract our data sets yourself, for instance, if you require a raw textual representation. The original McAuley data set is public already and we provide all of our extraction scripts.

    Extraction scripts and experiments: https://github.com/mirkobunse/ecml22

    Original data by McAuley: https://jmcauley.ucsd.edu/data/amazon/

  9. Z

    Data from: Exploring deep learning techniques for wild animal behaviour...

    • data.niaid.nih.gov
    Updated Jan 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maekawa, Takuya (2024). Data from: Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10557258
    Explore at:
    Dataset updated
    Jan 23, 2024
    Dataset provided by
    Otsuka, Ryoma
    Mizutani, Yuichi
    Maekawa, Takuya
    Tanigaki, Kei
    Koyama, Shiho
    Yoda, Ken
    Yoshimura, Naoya
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    1: Machine learning-based behaviour classification using acceleration data is a powerful tool in bio-logging research. Deep learning architectures such as convolutional neural networks (CNN), long short-term memory (LSTM), and self-attention mechanism as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration-based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached, and complexity in data due to complex animal-specific behaviours, which may have limited the application of deep learning techniques in this area.

    2: To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup, and pre-training of deep learning models with unlabelled data, using datasets from two species of wild seabirds and state-of-the-art deep learning model architectures.

    3: Data augmentation improved the overall model performance when one of various techniques (none, scaling, jittering, permutation, time-warping, and rotation) was randomly applied to each data during mini-batch training. Manifold mixup also improved model performance, but not as much as random data augmentation. Pre-training with unlabelled data did not improve model performance. The state-of-the-art deep learning models, including a model consisting of four CNN layers, an LSTM layer, and a multi-head attention layer, as well as its modified version with shortcut connection, showed better performance among other comparative models. Using only raw acceleration data as inputs, these models outperformed classic machine learning approaches that used 119 handcrafted features.

    4: Our experiments showed that deep learning techniques are promising for acceleration-based behaviour classification of wild animals and highlighted some challenges (e.g. effective use of unlabelled data). There is scope for greater exploration of deep learning techniques in wild animal studies (e.g. advanced data augmentation, multimodal sensor data use, transfer learning, and self-supervised learning). We hope that this study will stimulate the development of deep learning techniques for wild animal behaviour classification using time-series sensor data.

    This abstract is cited from the original article "Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers" in Methods in Ecology and Evolution (Otsuka et al., 2024).Please see README for the details of the datasets.

  10. n

    Data from: Efficient Deep Learning Methods for Medical Image Analysis

    • curate.nd.edu
    pdf
    Updated Nov 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yaopeng Peng (2024). Efficient Deep Learning Methods for Medical Image Analysis [Dataset]. http://doi.org/10.7274/27147567.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Nov 11, 2024
    Dataset provided by
    University of Notre Dame
    Authors
    Yaopeng Peng
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Medical image analysis plays a critical role in a range of medical applications, including diagnosis, treatment planning, and monitoring disease progression. However, it presents significant challenges due to the inherent complexity of the human body, as well as variability in image acquisition techniques, noise, and artifacts.

    Although deep learning methods have demonstrated considerable promise in medical image analysis, they frequently necessitate large volumes of annotated data for effective model training. Acquiring such annotated data can be particularly challenging in medical imaging due to factors such as the complexity of medical images and the imperative to uphold patient privacy. Furthermore, the annotation process is both time-consuming and costly, requiring the specialized expertise of medical professionals. Consequently, the limited availability of annotated data for training deep learning models often results in overfitting and suboptimal generalization to new data.

    Advances in medical image analysis have benefited from progress in foundational models originally developed for natural image domains. Innovations such as the integration of topological features into image representations and the application of Vision Transformers (ViTs) to capture global dependencies have proven valuable. However, these models often face significant challenges, including high computational costs and inference latency. Thus, there is an urgent need to develop approaches that are both data-efficient and computationally efficient to overcome these limitations. This dissertation presents six methods designed to improve segmentation and classification performance across both medical and natural scene domains. These methods include selecting the most informative slices for annotation, utilizing unlabeled slices, extracting additional topological information from existing datasets, and developing efficient Vision Transformer models to enhance performance while reducing computational costs.

    First, we employ an unsupervised method to identify the most effective and representative 2D slices from 3D calf muscle images for annotation. Subsequently, we generate pseudo-labels for all unlabeled slices and train a 3D segmentation model using both the labeled and pseudo-labeled slices. Second, we enhance the model by refining the pseudo-labels with a bi-directional hierarchical Earth Mover's Distance (bi-HEMD) algorithm and fine-tuning the segmentation results using the Primal-Dual Interior Point Method (IPM). Third, we develop a method that integrates both topological features and features extracted by a convolutional neural network (CNN) to improve performance. Fourth, we introduce a Group Vision Transformer mechanism to reduce computational complexity and model parameters, while enhancing feature diversity and reducing feature redundancy. Finally, we develop two Vision Transformer models to improve segmentation performance for detecting thin-cap fibroatheroma (TCFA) in intravascular optical coherence tomography (IVOCT) images and for skin lesion and polyp segmentation.

    The performance of image recognition in both medical and natural domains can be further enhanced by developing more advanced models. Accordingly, we propose four promising future directions. First, we aim to utilize the Wavelet Transform to mitigate information loss during the down-sampling process, thereby improving detection of small objects. Second, we plan to develop a Multi-Branch Vision Transformer to capture features across various scales while reducing computational costs and inference latency. Third, we intend to create a hierarchical Hilbert Mamba framework for image recognition, which will introduce greater spatial locality and facilitate smoother transitions among image tokens. Finally, we propose to develop a semi-supervised model for medical image segmentation, based on the Segment Anything Model, to address challenges associated with sparse annotations.

  11. t

    Classification of gravure printed patterns using convolutional neural...

    • tudatalib.ulb.tu-darmstadt.de
    Updated 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rothmann-Brumm, Pauline (2023). Classification of gravure printed patterns using convolutional neural networks (Python code) [Dataset]. http://doi.org/10.48328/tudatalib-1147
    Explore at:
    Dataset updated
    2023
    Authors
    Rothmann-Brumm, Pauline
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset contains Python code ('code_DeepLearn_ImgClass.zip') for automated classification of gravure printed patterns from the HYPA-p dataset. The developed algorithm performs supervised deep learning of convolutional neural networks (CNNs) on labeled data ('CNN_dataset.zip'), i.e. selected, labeled 'S-subfields' from the HYPA-p dataset. 'CNN_dataset.zip' is a subset from the images in the folder 'labeled_data.zip', which can be created with the provided Python code. PyTorch is used as a deep learning framework. The Python code yields trained CNNs, which can be used for automated classification of unlabeled data from the HYPA-p dataset. Well-known, pre-trained network architectures like Densenet-161 or MobileNetV2 are used as a starting point for training. Several trained CNNs are included in this submission, see 'trained_CNN_models.zip'. Further information can be found in the dissertation of Pauline Rothmann-Brumm (2023) and in the provided README-file.

  12. Data from: Amos: A large-scale abdominal multi-organ benchmark for versatile...

    • zenodo.org
    • explore.openaire.eu
    csv, zip
    Updated May 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JI YUANFENG; JI YUANFENG (2023). Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation [Dataset]. http://doi.org/10.5281/zenodo.7262581
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    May 25, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    JI YUANFENG; JI YUANFENG
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans in recent years, a comprehensive evaluation of the models' capabilities is hampered by the lack of a large-scale benchmark from diverse clinical scenarios. Constraint by the high cost of collecting and labeling 3D medical data, most of the deep learning models to date are driven by datasets with a limited number of organs of interest or samples, which still limits the power of modern deep models and makes it difficult to provide a fully comprehensive and fair estimate of various methods. To mitigate the limitations, we present AMOS, a large-scale, diverse, clinical dataset for abdominal organ segmentation. AMOS provides 500 CT and 100 MRI scans collected from multi-center, multi-vendor, multi-modality, multi-phase, multi-disease patients, each with voxel-level annotations of 15 abdominal organs, providing challenging examples and test-bed for studying robust segmentation algorithms under diverse targets and scenarios. We further benchmark several state-of-the-art medical segmentation models to evaluate the status of the existing methods on this new challenging dataset. We have made our datasets, benchmark servers, and baselines publicly available, and hope to inspire future research. The paper can be found at https://arxiv.org/pdf/2206.08023.pdf

    In addition to providing the labeled 600 CT and MRI scans, we expect to provide 2000 CT and 1200 MRI scans without labels to support more learning tasks (semi-supervised, un-supervised, domain adaption, ...). The link can be found in:

    if you found this dataset useful for your research, please cite:

    @inproceedings{NEURIPS2022_ee604e1b,
     author = {Ji, Yuanfeng and Bai, Haotian and GE, Chongjian and Yang, Jie and Zhu, Ye and Zhang, Ruimao and Li, Zhen and Zhanng, Lingyan and Ma, Wanling and Wan, Xiang and Luo, Ping},
     booktitle = {Advances in Neural Information Processing Systems},
     editor = {S. Koyejo and S. Mohamed and A. Agarwal and D. Belgrave and K. Cho and A. Oh},
     pages = {36722--36732},
     publisher = {Curran Associates, Inc.},
     title = {AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation},
     url = {https://proceedings.neurips.cc/paper_files/paper/2022/file/ee604e1bedbd069d9fc9328b7b9584be-Paper-Datasets_and_Benchmarks.pdf},
     volume = {35},
     year = {2022}
    }
    

  13. H

    Bean Plant Pathologies Dataset for Deep Learning Tasks

    • dataverse.harvard.edu
    • dataone.org
    Updated May 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marconi Lab (2024). Bean Plant Pathologies Dataset for Deep Learning Tasks [Dataset]. http://doi.org/10.7910/DVN/WFSLBY
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 3, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Marconi Lab
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset is part of the Makerere University Beans Image Dataset, designed to diagnose bean crop diseases and conduct spatial analysis, available on this link (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TCKVEW). It includes images categorized into four classes: healthy bean leaves, Angular Leaf Spot (ALS) in bean leaves, Bean Rust in bean leaves, and an additional 'unknown' class. These images help in classifying and detecting between the target bean leaf classes and other visuals. The dataset was used for the project that leverages edge computing and deep learning for the real-time identification of bean plant pathologies. The dataset is organized into two main folders, each serving a specific purpose. The first folder contains data for the classification task, with images distributed among the four classes: healthy, ALS, Bean Rust, and the unknown class. The second folder is dedicated to the detection task, featuring annotations for ALS and Bean Rust, as well as unlabeled healthy images to enhance the learning of detection models.

  14. UVP5 data sorted with EcoTaxa and MorphoCluster

    • seanoe.org
    image/*
    Updated 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rainer Kiko; Simon-Martin Schröder (2020). UVP5 data sorted with EcoTaxa and MorphoCluster [Dataset]. http://doi.org/10.17882/73002
    Explore at:
    image/*Available download formats
    Dataset updated
    2020
    Dataset provided by
    SEANOE
    Authors
    Rainer Kiko; Simon-Martin Schröder
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Time period covered
    Oct 23, 2012 - Aug 7, 2017
    Area covered
    Description

    here, we provide plankton image data that was sorted with the web applications ecotaxa and morphocluster. the data set was used for image classification tasks as described in schröder et. al (in preparation) and does not include any geospatial or temporal meta-data.plankton was imaged using the underwater vision profiler 5 (picheral et al. 2010) in various regions of the world's oceans between 2012-10-24 and 2017-08-08.this data publication consists of an archive containing "training.csv" (list of 392k training images for classification, validated using ecotaxa), "validation.csv" (list of 196k validation images for classification, validated using ecotaxa), "unlabeld.csv" (list of 1m unlabeled images), "morphocluster.csv" (1.2m objects validated using morphocluster, a subset of "unlabeled.csv" and "validation.csv") and the image files themselves. the csv files each contain the columns "object_id" (a unique id), "image_fn" (the relative filename), and "label" (the assigned name).the training and validation sets were sorted into 65 classes using the web application ecotaxa (http://ecotaxa.obs-vlfr.fr). this data shows a severe class imbalance; the 10% most populated classes contain more than 80% of the objects and the class sizes span four orders of magnitude. the validation set and a set of additional 1m unlabeled images were sorted during the first trial of morphocluster (https://github.com/morphocluster).the images in this data set were sampled during rv meteor cruises m92, m93, m96, m97, m98, m105, m106, m107, m108, m116, m119, m121, m130, m131, m135, m136, m137 and m138, during rv maria s merian cruises msm22, msm23, msm40 and msm49, during the rv polarstern cruise ps88b and during the fluxes1 experiment with rv sarmiento de gamboa.the following people have contributed to the sorting of the image data on ecotaxa:rainer kiko, tristan biard, benjamin blanc, svenja christiansen, justine courboules, charlotte eich, jannik faustmann, christine gawinski, augustin lafond, aakash panchal, marc picheral, akanksha singh and helena haussin schröder et al. (in preparation), the training set serves as a source for knowledge transfer in the training of the feature extractor. the classification using morphocluster was conducted by rainer kiko. used labels are operational and not yet matched to respective ecotaxa classes.

  15. Data from: SemEval-2021 Task 10: Source-Free Domain Adaptation for Semantic...

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated Jul 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Egoitz Laparra; Egoitz Laparra; Xin Su; Xin Su; Yiyun Zhao; Yiyun Zhao; Özlem Uzuner; Özlem Uzuner; Timothy A. Miller; Timothy A. Miller; Steven Bethard; Steven Bethard (2021). SemEval-2021 Task 10: Source-Free Domain Adaptation for Semantic Processing [Dataset]. http://doi.org/10.5281/zenodo.5132956
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Jul 28, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Egoitz Laparra; Egoitz Laparra; Xin Su; Xin Su; Yiyun Zhao; Yiyun Zhao; Özlem Uzuner; Özlem Uzuner; Timothy A. Miller; Timothy A. Miller; Steven Bethard; Steven Bethard
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data sharing restrictions are common in NLP datasets. For example, Twitter policies do not allow sharing of tweet text, though tweet IDs may be shared. The situation is even more common in clinical NLP, where patient health information must be protected, and annotations over health text, when released at all, often require the signing of complex data use agreements. The SemEval-2021 Task 10 framework asks participants to develop semantic annotation systems in the face of data sharing constraints. A participant's goal is to develop an accurate system for a target domain when annotations exist for a related domain but cannot be distributed. Instead of annotated training data, participants are given a model trained on the annotations. Then, given unlabeled target domain data, they are asked to make predictions.

    Website: https://machine-learning-for-medical-language.github.io/source-free-domain-adaptation/

    CodaLab site: https://competitions.codalab.org/competitions/26152

    Github repository: https://github.com/Machine-Learning-for-Medical-Language/source-free-domain-adaptation

  16. f

    DataSheet_1_HiRAND: A novel GCN semi-supervised deep learning-based...

    • frontiersin.figshare.com
    pdf
    Updated Jun 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yue Huang; Zhiwei Rong; Liuchao Zhang; Zhenyi Xu; Jianxin Ji; Jia He; Weisha Liu; Yan Hou; Kang Li (2023). DataSheet_1_HiRAND: A novel GCN semi-supervised deep learning-based framework for classification and feature selection in drug research and development.pdf [Dataset]. http://doi.org/10.3389/fonc.2023.1047556.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    Frontiers
    Authors
    Yue Huang; Zhiwei Rong; Liuchao Zhang; Zhenyi Xu; Jianxin Ji; Jia He; Weisha Liu; Yan Hou; Kang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The prediction of response to drugs before initiating therapy based on transcriptome data is a major challenge. However, identifying effective drug response label data costs time and resources. Methods available often predict poorly and fail to identify robust biomarkers due to the curse of dimensionality: high dimensionality and low sample size. Therefore, this necessitates the development of predictive models to effectively predict the response to drugs using limited labeled data while being interpretable. In this study, we report a novel Hierarchical Graph Random Neural Networks (HiRAND) framework to predict the drug response using transcriptome data of few labeled data and additional unlabeled data. HiRAND completes the information integration of the gene graph and sample graph by graph convolutional network (GCN). The innovation of our model is leveraging data augmentation strategy to solve the dilemma of limited labeled data and using consistency regularization to optimize the prediction consistency of unlabeled data across different data augmentations. The results showed that HiRAND achieved better performance than competitive methods in various prediction scenarios, including both simulation data and multiple drug response data. We found that the prediction ability of HiRAND in the drug vorinostat showed the best results across all 62 drugs. In addition, HiRAND was interpreted to identify the key genes most important to vorinostat response, highlighting critical roles for ribosomal protein-related genes in the response to histone deacetylase inhibition. Our HiRAND could be utilized as an efficient framework for improving the drug response prediction performance using few labeled data.

  17. Data from: Learning protein fitness models from evolutionary and...

    • zenodo.org
    • datadryad.org
    application/gzip
    Updated Jun 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chloe Hsu; Chloe Hsu; Hunter Nisonoff; Clara Fannjiang; Clara Fannjiang; Jennifer Listgarten; Hunter Nisonoff; Jennifer Listgarten (2022). Learning protein fitness models from evolutionary and assay-labeled data [Dataset]. http://doi.org/10.6078/d1k71b
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jun 5, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Chloe Hsu; Chloe Hsu; Hunter Nisonoff; Clara Fannjiang; Clara Fannjiang; Jennifer Listgarten; Hunter Nisonoff; Jennifer Listgarten
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Machine learning-based models of protein fitness typically learn from either unlabeled, evolutionarily-related sequences, or variant sequences with experimentally measured labels. For regimes where only limited experimental data are available, recent work has suggested methods for combining both sources of information. Toward that goal, we propose a simple combination approach that is competitive with, and on average outperforms more sophisticated methods. Our approach uses ridge regression on site-specific amino acid features combined with one density feature from modelling the evolutionary data. Within this approach, we find that a variational autoencoder-based density model showed the best overall performance, although any evolutionary density model can be used. Moreover, our analysis highlights the importance of systematic evaluations and sufficient baselines.

  18. Data from: The COUGHVID crowdsourcing dataset: A corpus for the study of...

    • zenodo.org
    • ekoizpen-zientifikoa.ehu.eus
    • +1more
    zip
    Updated Aug 26, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lara Orlandic; Lara Orlandic; Tomas Teijeiro; Tomas Teijeiro; David Atienza; David Atienza (2022). The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms [Dataset]. http://doi.org/10.5281/zenodo.4498364
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 26, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lara Orlandic; Lara Orlandic; Tomas Teijeiro; Tomas Teijeiro; David Atienza; David Atienza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    Cough audio signal classification has been successfully used to diagnose a variety of respiratory conditions, and there has been significant interest in leveraging Machine Learning (ML) to provide widespread COVID-19 screening. The COUGHVID dataset provides over 20,000 crowdsourced cough recordings representing a wide range of subject ages, genders, geographic locations, and COVID-19 statuses. Furthermore, experienced pulmonologists labeled more than 2,000 recordings to diagnose medical abnormalities present in the coughs, thereby contributing one of the largest expert-labeled cough datasets in existence that can be used for a plethora of cough audio classification tasks. As a result, the COUGHVID dataset contributes a wealth of cough recordings for training ML models to address the world’s most urgent health crises.

    Private Set and Testing Protocol

    Researchers interested in testing their models on the private test dataset should contact us at coughvid@epfl.ch, briefly explaining the type of validation they wish to make, and their obtained results obtained through cross-validation with the public data. Then, access to the unlabeled recordings will be provided, and the researchers should send the predictions of their models on these recordings. Finally, the performance metrics of the predictions will be sent to the researchers. The private testing data is not included in any file within our Zenodo record, and it can only be accessed by contacting the COUGHVID team at the aforementioned e-mail address.

  19. f

    Table_4_sscNOVA: a semi-supervised convolutional neural network for...

    • figshare.com
    xlsx
    Updated Feb 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haibo Li; Zhenhua Yu; Fang Du; Lijuan Song; Yang Gao; Fangyuan Shi (2024). Table_4_sscNOVA: a semi-supervised convolutional neural network for predicting functional regulatory variants in autoimmune diseases.xlsx [Dataset]. http://doi.org/10.3389/fimmu.2024.1323072.s005
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 6, 2024
    Dataset provided by
    Frontiers
    Authors
    Haibo Li; Zhenhua Yu; Fang Du; Lijuan Song; Yang Gao; Fangyuan Shi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Genome-wide association studies (GWAS) have identified thousands of variants in the human genome with autoimmune diseases. However, identifying functional regulatory variants associated with autoimmune diseases remains challenging, largely because of insufficient experimental validation data. We adopt the concept of semi-supervised learning by combining labeled and unlabeled data to develop a deep learning-based algorithm framework, sscNOVA, to predict functional regulatory variants in autoimmune diseases and analyze the functional characteristics of these regulatory variants. Compared to traditional supervised learning methods, our approach leverages more variants’ data to explore the relationship between functional regulatory variants and autoimmune diseases. Based on the experimentally curated testing dataset and evaluation metrics, we find that sscNOVA outperforms other state-of-the-art methods. Furthermore, we illustrate that sscNOVA can help to improve the prioritization of functional regulatory variants from lead single-nucleotide polymorphisms and the proxy variants in autoimmune GWAS data.

  20. A

    Unlabelled Weed Detection Images for Hot Peppers

    • data.amerigeoss.org
    Updated Nov 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Trinidad and Tobago (2022). Unlabelled Weed Detection Images for Hot Peppers [Dataset]. https://data.amerigeoss.org/dataset/weeddetection_hotpeppers
    Explore at:
    Dataset updated
    Nov 1, 2022
    Dataset provided by
    Trinidad and Tobago
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data contains images of Capsicum Annuum that have been grown on several smallholder farms in Trinidad and Tobago ;showing different levels of weed cover and different weed species. In most instances, weeds can be recognized by the naked eye. However, there are times when the weeds and the crops are of similar species and may appear almost identical. When weeds are plentiful and interwoven with crops, it becomes increasingly difficult to determine weed cover on a given piece of land. This data can be used in research surrounding weed detection in hot peppers. When accompanied by the labelled versions, this data can be used to train machine learning models for identifying weed detection in Capsicum Annuum (Hot Peppers).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Juan Sebastián Cañas; Juan Sebastián Cañas; Toro-Gómez María Paula; Moreira Sugai Larissa Sayuri; Moreira Sugai Larissa Sayuri; Luis Felipe Toledo; De Souza Franco Leandro; Neckel De Oliveira Selvino; Neckel De Oliveira Selvino; Pereira Bastos Rogerio; Pereira Bastos Rogerio; Llusia Diego; Llusia Diego; Ulloa Juan Sebastián; Ulloa Juan Sebastián; Toro-Gómez María Paula; Luis Felipe Toledo; De Souza Franco Leandro (2024). Unlabeled AnuraSet: A dataset for leveraging unlabeled data in machine learning models for passive acoustic monitoring [Dataset]. http://doi.org/10.5281/zenodo.11244814
Organization logo

Unlabeled AnuraSet: A dataset for leveraging unlabeled data in machine learning models for passive acoustic monitoring

Explore at:
zip, txtAvailable download formats
Dataset updated
May 27, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juan Sebastián Cañas; Juan Sebastián Cañas; Toro-Gómez María Paula; Moreira Sugai Larissa Sayuri; Moreira Sugai Larissa Sayuri; Luis Felipe Toledo; De Souza Franco Leandro; Neckel De Oliveira Selvino; Neckel De Oliveira Selvino; Pereira Bastos Rogerio; Pereira Bastos Rogerio; Llusia Diego; Llusia Diego; Ulloa Juan Sebastián; Ulloa Juan Sebastián; Toro-Gómez María Paula; Luis Felipe Toledo; De Souza Franco Leandro
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The Unlabeled AnuraSet (U-AnuraSet) is an extension of the original AnuraSet dataset. It consists of soundscape recordings from passive acoustic monitoring conducted in Brazil. The recording sites are identical to those in the original AnuraSet. Each site comprises 2,666 one-minute raw audio files of unlabeled data. The U-AnuraSet is publicly available to encourage machine learning researchers to explore innovative methods for leveraging unlabeled data in the training of models aimed at solving problems such as anuran call identification.

If you find the Unlabeled AnuraSet useful for your research, please consider citing it as follows:

Cañas, J.S., Toro-Gómez, M.P., Sugai, L.S.M., et al. A dataset for benchmarking Neotropical anuran calls identification in passive acoustic monitoring. Sci Data 10, 771 (2023). https://doi.org/10.1038/s41597-023-02666-2

Search
Clear search
Close search
Google apps
Main menu