41 datasets found

Unlabeled AnuraSet: A dataset for leveraging unlabeled data in machine...
zenodo.org
data.niaid.nih.gov
txt, zip
Updated May 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan Sebastián Cañas; Juan Sebastián Cañas; Toro-Gómez María Paula; Moreira Sugai Larissa Sayuri; Moreira Sugai Larissa Sayuri; Luis Felipe Toledo; De Souza Franco Leandro; Neckel De Oliveira Selvino; Neckel De Oliveira Selvino; Pereira Bastos Rogerio; Pereira Bastos Rogerio; Llusia Diego; Llusia Diego; Ulloa Juan Sebastián; Ulloa Juan Sebastián; Toro-Gómez María Paula; Luis Felipe Toledo; De Souza Franco Leandro (2024). Unlabeled AnuraSet: A dataset for leveraging unlabeled data in machine learning models for passive acoustic monitoring [Dataset]. http://doi.org/10.5281/zenodo.11244814
Explore at:
zip, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11244814
Dataset updated
May 27, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juan Sebastián Cañas; Juan Sebastián Cañas; Toro-Gómez María Paula; Moreira Sugai Larissa Sayuri; Moreira Sugai Larissa Sayuri; Luis Felipe Toledo; De Souza Franco Leandro; Neckel De Oliveira Selvino; Neckel De Oliveira Selvino; Pereira Bastos Rogerio; Pereira Bastos Rogerio; Llusia Diego; Llusia Diego; Ulloa Juan Sebastián; Ulloa Juan Sebastián; Toro-Gómez María Paula; Luis Felipe Toledo; De Souza Franco Leandro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Unlabeled AnuraSet (U-AnuraSet) is an extension of the original AnuraSet dataset. It consists of soundscape recordings from passive acoustic monitoring conducted in Brazil. The recording sites are identical to those in the original AnuraSet. Each site comprises 2,666 one-minute raw audio files of unlabeled data. The U-AnuraSet is publicly available to encourage machine learning researchers to explore innovative methods for leveraging unlabeled data in the training of models aimed at solving problems such as anuran call identification.

If you find the Unlabeled AnuraSet useful for your research, please consider citing it as follows:

Cañas, J.S., Toro-Gómez, M.P., Sugai, L.S.M., et al. A dataset for benchmarking Neotropical anuran calls identification in passive acoustic monitoring. Sci Data 10, 771 (2023). https://doi.org/10.1038/s41597-023-02666-2
f
Explanations for each cluster in Iris dataset.
plos.figshare.com
xls
Updated Oct 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liang Chen; Caiming Zhong; Zehua Zhang (2023). Explanations for each cluster in Iris dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0292960.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0292960.t003
Dataset updated
Oct 27, 2023
Dataset provided by
PLOS ONE
Authors
Liang Chen; Caiming Zhong; Zehua Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Clustering is an unsupervised machine learning technique whose goal is to cluster unlabeled data. But traditional clustering methods only output a set of results and do not provide any explanations of the results. Although in the literature a number of methods based on decision tree have been proposed to explain the clustering results, most of them have some disadvantages, such as too many branches and too deep leaves, which lead to complex explanations and make it difficult for users to understand. In this paper, a hypercube overlay model based on multi-objective optimization is proposed to achieve succinct explanations of clustering results. The model designs two objective functions based on the number of hypercubes and the compactness of instances and then uses multi-objective optimization to find a set of nondominated solutions. Finally, an Utopia point is defined to determine the most suitable solution, in which each cluster can be covered by as few hypercubes as possible. Based on these hypercubes, an explanations of each cluster is provided. Upon verification on synthetic and real datasets respectively, it shows that the model can provide a concise and understandable explanations to users.
O
Mal-Activity
opendatalab.com
paperswithcode.com
zip
Updated Mar 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of New South Wales (2023). Mal-Activity [Dataset]. https://opendatalab.com/OpenDataLab/Mal-Activity
Explore at:
zipAvailable download formats
Dataset updated
Mar 17, 2023
Dataset provided by
Nokia Bell Labs
University of New South Wales
Macquarie University
University of Sydney
Description
This is a dataset of Internet malicious activity (mal-activity in short). It contains more than 51 million mal-activity reports involving 662K unique IP addresses covering the period form January 2007 to June 2017. Leveraging the Wayback Machine, antivirus (AV) tool reports and several additional public datasets (e.g., BGP Route Views and Internet registries) the data is enriched with historical meta-information including geo-locations (countries), autonomous system (AS) numbers and types of mal-activity. An initially labelled dataset of approx 1.57 million mal-activities (obtained from public blacklists) is used to train a machine learning classifier to classify the remaining unlabeled dataset of approx 44 million mal-activities obtained through additional sources.
f
Setting of parameters.
plos.figshare.com
xls
Updated Oct 27, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liang Chen; Caiming Zhong; Zehua Zhang (2023). Setting of parameters. [Dataset]. http://doi.org/10.1371/journal.pone.0292960.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0292960.t002
Dataset updated
Oct 27, 2023
Dataset provided by
PLOS ONE
Authors
Liang Chen; Caiming Zhong; Zehua Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Clustering is an unsupervised machine learning technique whose goal is to cluster unlabeled data. But traditional clustering methods only output a set of results and do not provide any explanations of the results. Although in the literature a number of methods based on decision tree have been proposed to explain the clustering results, most of them have some disadvantages, such as too many branches and too deep leaves, which lead to complex explanations and make it difficult for users to understand. In this paper, a hypercube overlay model based on multi-objective optimization is proposed to achieve succinct explanations of clustering results. The model designs two objective functions based on the number of hypercubes and the compactness of instances and then uses multi-objective optimization to find a set of nondominated solutions. Finally, an Utopia point is defined to determine the most suitable solution, in which each cluster can be covered by as few hypercubes as possible. Based on these hypercubes, an explanations of each cluster is provided. Upon verification on synthetic and real datasets respectively, it shows that the model can provide a concise and understandable explanations to users.
UCI and OpenML Data Sets for Ordinal Quantification
zenodo.org
zip
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8177302
Dataset updated
Jul 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
Data from: Exploring deep learning techniques for wild animal behaviour...
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated Feb 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa (2024). Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers [Dataset]. http://doi.org/10.5061/dryad.2ngf1vhwk
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2ngf1vhwk
Dataset updated
Feb 22, 2024
Dataset provided by
Osaka University
Nagoya University
Authors
Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, using datasets from two species of wild seabirds and state‐of‐the‐art deep learning model architectures. Data augmentation improved the overall model performance when one of the various techniques (none, scaling, jittering, permutation, time‐warping and rotation) was randomly applied to each data during mini‐batch training. Manifold mixup also improved model performance, but not as much as random data augmentation. Pre‐training with unlabelled data did not improve model performance. The state‐of‐the‐art deep learning models, including a model consisting of four CNN layers, an LSTM layer and a multi‐head attention layer, as well as its modified version with shortcut connection, showed better performance among other comparative models. Using only raw acceleration data as inputs, these models outperformed classic machine learning approaches that used 119 handcrafted features. Our experiments showed that deep learning techniques are promising for acceleration‐based behaviour classification of wild animals and highlighted some challenges (e.g. effective use of unlabelled data). There is scope for greater exploration of deep learning techniques in wild animal studies (e.g. advanced data augmentation, multimodal sensor data use, transfer learning and self‐supervised learning). We hope that this study will stimulate the development of deep learning techniques for wild animal behaviour classification using time‐series sensor data.

This abstract is cited from the original article "Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers" in Methods in Ecology and Evolution (Otsuka et al., 2024).Please see README for the details of the datasets.
d
Data and code from: Learning a deep language model for microbiomes: The...
search.dataone.org
data.niaid.nih.gov
+2more
Updated Feb 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Quintin Pope; Rohan Varma; Christine Tataru; Maude David; Xiaoli Fern (2025). Data and code from: Learning a deep language model for microbiomes: The power of large scale unlabeled microbiome data [Dataset]. https://search.dataone.org/view/sha256%3A8c00b06c01187eb7b0df45066ab9152b765faa96c82b52a4ce6e18de1632a2b9
Explore at:
Dataset updated
Feb 26, 2025
Dataset provided by
Dryad Digital Repository
Authors
Quintin Pope; Rohan Varma; Christine Tataru; Maude David; Xiaoli Fern
Description
We use open source human gut microbiome data to learn a microbial â€œlanguageâ€ model by adapting techniques from Natural Language Processing (NLP). Our microbial â€œlanguageâ€ model is trained in a self-supervised fashion (i.e., without additional external labels) to capture the interactions among different microbial taxa and the common compositional patterns in microbial communities. The learned model produces contextualized taxon representations that allow a single microbial taxon to be represented differently according to the specific microbial environment in which it appears. The model further provides a sample representation by collectively interpreting different microbial taxa in the sample and their interactions as a whole. We demonstrate that, while our sample representation performs comparably to baseline models in in-domain prediction tasks such as predicting Irritable Bowel Disease (IBD) and diet patterns, it significantly outperforms them when generalizing to test data from indep..., No additional raw data was collected for this project. All inputs are available publicly. American Gut Project, Halfvarson, and Schirmer raw data are available from the NCBI database (accession numbers PRJEB11419, PRJEB18471, and PRJNA398089, respectively). We used the curated data produced by Tataru and David, 2020., , # Code and data for "Learning a deep language model for microbiomes: the power of large scale unlabeled microbiome data"

Data:

vocab_embeddings.npy

Fixed vocabulary embeddings produced from prior work: Decoding the language of microbiomes using word-embedding techniques, and applications in inflammatory bowel disease. Adapted from here.

microbiomedata.zip

Contains the labels and data for the three datasets used in this study. Specifically, it includes:

IBD_(test|train)*(512|otu).npy and IBD*(test|train)_labels.npy

halfvarson_(512_otu|otu).npy and halfvarson_IBD_labels.npy

schirmer_IBD_(512_otu|otu).npy and schirmer_IBD_labels.npy

(test|train)encodings_(512|1897).npy

The data are stored as n_samples x max_sample_size x 2 numpy arrays, containing both the vocab IDs of the taxa in the ...
Product Reviews for Ordinal Quantification
zenodo.org
zip
Updated Oct 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). Product Reviews for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.7081208
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7081208
Dataset updated
Oct 4, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
Description
This data set comprises a labeled training set, validation samples, and testing samples for ordinal quantification. It appears in our research paper "Ordinal Quantification Through Regularization", which we have published at ECML-PKDD 2022.

The data is extracted from the McAuley data set of product reviews in Amazon, where the goal is to predict the 5-star rating of each textual review. We have sampled this data according to two protocols that are suited for quantification research. The goal of quantification is not to predict the star rating of each individual instance, but the distribution of ratings in sets of textual reviews. More generally speaking, quantification aims at estimating the distribution of labels in unlabeled samples of data.

The first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification, where classes are ordered and a similarity of neighboring classes can be assumed. 5-star ratings of product reviews lie on an ordinal scale and, hence, pose such an ordinal quantification task.

This data set comprises two representations of the McAuley data. The first representation consists of TF-IDF features. The second representation is a RoBERTa embedding. This second representation is dense, while the first is sparse. In our experience, logistic regression classifiers work well with both representations. RoBERTa embeddings yield more accurate predictors than the TF-IDF features.

You can extract our data sets yourself, for instance, if you require a raw textual representation. The original McAuley data set is public already and we provide all of our extraction scripts.

Extraction scripts and experiments: https://github.com/mirkobunse/ecml22

Original data by McAuley: https://jmcauley.ucsd.edu/data/amazon/
Z
Data from: Exploring deep learning techniques for wild animal behaviour...
data.niaid.nih.gov
Updated Jan 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maekawa, Takuya (2024). Data from: Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10557258
Explore at:
Dataset updated
Jan 23, 2024
Dataset provided by
Otsuka, Ryoma
Mizutani, Yuichi
Maekawa, Takuya
Tanigaki, Kei
Koyama, Shiho
Yoda, Ken
Yoshimura, Naoya
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
1: Machine learning-based behaviour classification using acceleration data is a powerful tool in bio-logging research. Deep learning architectures such as convolutional neural networks (CNN), long short-term memory (LSTM), and self-attention mechanism as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration-based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached, and complexity in data due to complex animal-specific behaviours, which may have limited the application of deep learning techniques in this area.

2: To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup, and pre-training of deep learning models with unlabelled data, using datasets from two species of wild seabirds and state-of-the-art deep learning model architectures.

3: Data augmentation improved the overall model performance when one of various techniques (none, scaling, jittering, permutation, time-warping, and rotation) was randomly applied to each data during mini-batch training. Manifold mixup also improved model performance, but not as much as random data augmentation. Pre-training with unlabelled data did not improve model performance. The state-of-the-art deep learning models, including a model consisting of four CNN layers, an LSTM layer, and a multi-head attention layer, as well as its modified version with shortcut connection, showed better performance among other comparative models. Using only raw acceleration data as inputs, these models outperformed classic machine learning approaches that used 119 handcrafted features.

4: Our experiments showed that deep learning techniques are promising for acceleration-based behaviour classification of wild animals and highlighted some challenges (e.g. effective use of unlabelled data). There is scope for greater exploration of deep learning techniques in wild animal studies (e.g. advanced data augmentation, multimodal sensor data use, transfer learning, and self-supervised learning). We hope that this study will stimulate the development of deep learning techniques for wild animal behaviour classification using time-series sensor data.

This abstract is cited from the original article "Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers" in Methods in Ecology and Evolution (Otsuka et al., 2024).Please see README for the details of the datasets.
n
Data from: Efficient Deep Learning Methods for Medical Image Analysis
curate.nd.edu
pdf
Updated Nov 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yaopeng Peng (2024). Efficient Deep Learning Methods for Medical Image Analysis [Dataset]. http://doi.org/10.7274/27147567.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.7274/27147567.v1
Dataset updated
Nov 11, 2024
Dataset provided by
University of Notre Dame
Authors
Yaopeng Peng
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Medical image analysis plays a critical role in a range of medical applications, including diagnosis, treatment planning, and monitoring disease progression. However, it presents significant challenges due to the inherent complexity of the human body, as well as variability in image acquisition techniques, noise, and artifacts.

Although deep learning methods have demonstrated considerable promise in medical image analysis, they frequently necessitate large volumes of annotated data for effective model training. Acquiring such annotated data can be particularly challenging in medical imaging due to factors such as the complexity of medical images and the imperative to uphold patient privacy. Furthermore, the annotation process is both time-consuming and costly, requiring the specialized expertise of medical professionals. Consequently, the limited availability of annotated data for training deep learning models often results in overfitting and suboptimal generalization to new data.

Advances in medical image analysis have benefited from progress in foundational models originally developed for natural image domains. Innovations such as the integration of topological features into image representations and the application of Vision Transformers (ViTs) to capture global dependencies have proven valuable. However, these models often face significant challenges, including high computational costs and inference latency. Thus, there is an urgent need to develop approaches that are both data-efficient and computationally efficient to overcome these limitations. This dissertation presents six methods designed to improve segmentation and classification performance across both medical and natural scene domains. These methods include selecting the most informative slices for annotation, utilizing unlabeled slices, extracting additional topological information from existing datasets, and developing efficient Vision Transformer models to enhance performance while reducing computational costs.

First, we employ an unsupervised method to identify the most effective and representative 2D slices from 3D calf muscle images for annotation. Subsequently, we generate pseudo-labels for all unlabeled slices and train a 3D segmentation model using both the labeled and pseudo-labeled slices. Second, we enhance the model by refining the pseudo-labels with a bi-directional hierarchical Earth Mover's Distance (bi-HEMD) algorithm and fine-tuning the segmentation results using the Primal-Dual Interior Point Method (IPM). Third, we develop a method that integrates both topological features and features extracted by a convolutional neural network (CNN) to improve performance. Fourth, we introduce a Group Vision Transformer mechanism to reduce computational complexity and model parameters, while enhancing feature diversity and reducing feature redundancy. Finally, we develop two Vision Transformer models to improve segmentation performance for detecting thin-cap fibroatheroma (TCFA) in intravascular optical coherence tomography (IVOCT) images and for skin lesion and polyp segmentation.

The performance of image recognition in both medical and natural domains can be further enhanced by developing more advanced models. Accordingly, we propose four promising future directions. First, we aim to utilize the Wavelet Transform to mitigate information loss during the down-sampling process, thereby improving detection of small objects. Second, we plan to develop a Multi-Branch Vision Transformer to capture features across various scales while reducing computational costs and inference latency. Third, we intend to create a hierarchical Hilbert Mamba framework for image recognition, which will introduce greater spatial locality and facilitate smoother transitions among image tokens. Finally, we propose to develop a semi-supervised model for medical image segmentation, based on the Segment Anything Model, to address challenges associated with sparse annotations.
t
Classification of gravure printed patterns using convolutional neural...
tudatalib.ulb.tu-darmstadt.de
Updated 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rothmann-Brumm, Pauline (2023). Classification of gravure printed patterns using convolutional neural networks (Python code) [Dataset]. http://doi.org/10.48328/tudatalib-1147
Explore at:
Unique identifier
https://doi.org/10.48328/tudatalib-1147
Dataset updated
2023
Authors
Rothmann-Brumm, Pauline
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset contains Python code ('code_DeepLearn_ImgClass.zip') for automated classification of gravure printed patterns from the HYPA-p dataset. The developed algorithm performs supervised deep learning of convolutional neural networks (CNNs) on labeled data ('CNN_dataset.zip'), i.e. selected, labeled 'S-subfields' from the HYPA-p dataset. 'CNN_dataset.zip' is a subset from the images in the folder 'labeled_data.zip', which can be created with the provided Python code. PyTorch is used as a deep learning framework. The Python code yields trained CNNs, which can be used for automated classification of unlabeled data from the HYPA-p dataset. Well-known, pre-trained network architectures like Densenet-161 or MobileNetV2 are used as a starting point for training. Several trained CNNs are included in this submission, see 'trained_CNN_models.zip'. Further information can be found in the dissertation of Pauline Rothmann-Brumm (2023) and in the provided README-file.
Data from: Amos: A large-scale abdominal multi-organ benchmark for versatile...
zenodo.org
explore.openaire.eu
csv, zip
Updated May 25, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JI YUANFENG; JI YUANFENG (2023). Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation [Dataset]. http://doi.org/10.5281/zenodo.7262581
Explore at:
csv, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7262581
Dataset updated
May 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
JI YUANFENG; JI YUANFENG
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans in recent years, a comprehensive evaluation of the models' capabilities is hampered by the lack of a large-scale benchmark from diverse clinical scenarios. Constraint by the high cost of collecting and labeling 3D medical data, most of the deep learning models to date are driven by datasets with a limited number of organs of interest or samples, which still limits the power of modern deep models and makes it difficult to provide a fully comprehensive and fair estimate of various methods. To mitigate the limitations, we present AMOS, a large-scale, diverse, clinical dataset for abdominal organ segmentation. AMOS provides 500 CT and 100 MRI scans collected from multi-center, multi-vendor, multi-modality, multi-phase, multi-disease patients, each with voxel-level annotations of 15 abdominal organs, providing challenging examples and test-bed for studying robust segmentation algorithms under diverse targets and scenarios. We further benchmark several state-of-the-art medical segmentation models to evaluate the status of the existing methods on this new challenging dataset. We have made our datasets, benchmark servers, and baselines publicly available, and hope to inspire future research. The paper can be found at https://arxiv.org/pdf/2206.08023.pdf

In addition to providing the labeled 600 CT and MRI scans, we expect to provide 2000 CT and 1200 MRI scans without labels to support more learning tasks (semi-supervised, un-supervised, domain adaption, ...). The link can be found in:

labeled data (500CT+100MRI)

unlabeled data Part I (900CT)

unlabeled data Part II (1100CT) (Now there are 1000CT, we will replenish to 1100CT)

unlabeled data Part III (1200MRI)

if you found this dataset useful for your research, please cite:

@inproceedings{NEURIPS2022_ee604e1b, author = {Ji, Yuanfeng and Bai, Haotian and GE, Chongjian and Yang, Jie and Zhu, Ye and Zhang, Ruimao and Li, Zhen and Zhanng, Lingyan and Ma, Wanling and Wan, Xiang and Luo, Ping}, booktitle = {Advances in Neural Information Processing Systems}, editor = {S. Koyejo and S. Mohamed and A. Agarwal and D. Belgrave and K. Cho and A. Oh}, pages = {36722--36732}, publisher = {Curran Associates, Inc.}, title = {AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation}, url = {https://proceedings.neurips.cc/paper_files/paper/2022/file/ee604e1bedbd069d9fc9328b7b9584be-Paper-Datasets_and_Benchmarks.pdf}, volume = {35}, year = {2022} }
H
Bean Plant Pathologies Dataset for Deep Learning Tasks
dataverse.harvard.edu
dataone.org
Updated May 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marconi Lab (2024). Bean Plant Pathologies Dataset for Deep Learning Tasks [Dataset]. http://doi.org/10.7910/DVN/WFSLBY
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/WFSLBY
Dataset updated
May 3, 2024
Dataset provided by
Harvard Dataverse
Authors
Marconi Lab
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset is part of the Makerere University Beans Image Dataset, designed to diagnose bean crop diseases and conduct spatial analysis, available on this link (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TCKVEW). It includes images categorized into four classes: healthy bean leaves, Angular Leaf Spot (ALS) in bean leaves, Bean Rust in bean leaves, and an additional 'unknown' class. These images help in classifying and detecting between the target bean leaf classes and other visuals. The dataset was used for the project that leverages edge computing and deep learning for the real-time identification of bean plant pathologies. The dataset is organized into two main folders, each serving a specific purpose. The first folder contains data for the classification task, with images distributed among the four classes: healthy, ALS, Bean Rust, and the unknown class. The second folder is dedicated to the detection task, featuring annotations for ALS and Bean Rust, as well as unlabeled healthy images to enhance the learning of detection models.
UVP5 data sorted with EcoTaxa and MorphoCluster
seanoe.org
image/*
Updated 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rainer Kiko; Simon-Martin Schröder (2020). UVP5 data sorted with EcoTaxa and MorphoCluster [Dataset]. http://doi.org/10.17882/73002
Explore at:
image/*Available download formats
Unique identifier
https://doi.org/10.17882/73002
Dataset updated
2020
Dataset provided by
SEANOE
Authors
Rainer Kiko; Simon-Martin Schröder
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Time period covered
Oct 23, 2012 - Aug 7, 2017
Area covered
Description
here, we provide plankton image data that was sorted with the web applications ecotaxa and morphocluster. the data set was used for image classification tasks as described in schröder et. al (in preparation) and does not include any geospatial or temporal meta-data.plankton was imaged using the underwater vision profiler 5 (picheral et al. 2010) in various regions of the world's oceans between 2012-10-24 and 2017-08-08.this data publication consists of an archive containing "training.csv" (list of 392k training images for classification, validated using ecotaxa), "validation.csv" (list of 196k validation images for classification, validated using ecotaxa), "unlabeld.csv" (list of 1m unlabeled images), "morphocluster.csv" (1.2m objects validated using morphocluster, a subset of "unlabeled.csv" and "validation.csv") and the image files themselves. the csv files each contain the columns "object_id" (a unique id), "image_fn" (the relative filename), and "label" (the assigned name).the training and validation sets were sorted into 65 classes using the web application ecotaxa (http://ecotaxa.obs-vlfr.fr). this data shows a severe class imbalance; the 10% most populated classes contain more than 80% of the objects and the class sizes span four orders of magnitude. the validation set and a set of additional 1m unlabeled images were sorted during the first trial of morphocluster (https://github.com/morphocluster).the images in this data set were sampled during rv meteor cruises m92, m93, m96, m97, m98, m105, m106, m107, m108, m116, m119, m121, m130, m131, m135, m136, m137 and m138, during rv maria s merian cruises msm22, msm23, msm40 and msm49, during the rv polarstern cruise ps88b and during the fluxes1 experiment with rv sarmiento de gamboa.the following people have contributed to the sorting of the image data on ecotaxa:rainer kiko, tristan biard, benjamin blanc, svenja christiansen, justine courboules, charlotte eich, jannik faustmann, christine gawinski, augustin lafond, aakash panchal, marc picheral, akanksha singh and helena haussin schröder et al. (in preparation), the training set serves as a source for knowledge transfer in the training of the feature extractor. the classification using morphocluster was conducted by rainer kiko. used labels are operational and not yet matched to respective ecotaxa classes.
Data from: SemEval-2021 Task 10: Source-Free Domain Adaptation for Semantic...
zenodo.org
data.niaid.nih.gov
bin, zip
Updated Jul 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Egoitz Laparra; Egoitz Laparra; Xin Su; Xin Su; Yiyun Zhao; Yiyun Zhao; Özlem Uzuner; Özlem Uzuner; Timothy A. Miller; Timothy A. Miller; Steven Bethard; Steven Bethard (2021). SemEval-2021 Task 10: Source-Free Domain Adaptation for Semantic Processing [Dataset]. http://doi.org/10.5281/zenodo.5132956
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5132956
Dataset updated
Jul 28, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Egoitz Laparra; Egoitz Laparra; Xin Su; Xin Su; Yiyun Zhao; Yiyun Zhao; Özlem Uzuner; Özlem Uzuner; Timothy A. Miller; Timothy A. Miller; Steven Bethard; Steven Bethard
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data sharing restrictions are common in NLP datasets. For example, Twitter policies do not allow sharing of tweet text, though tweet IDs may be shared. The situation is even more common in clinical NLP, where patient health information must be protected, and annotations over health text, when released at all, often require the signing of complex data use agreements. The SemEval-2021 Task 10 framework asks participants to develop semantic annotation systems in the face of data sharing constraints. A participant's goal is to develop an accurate system for a target domain when annotations exist for a related domain but cannot be distributed. Instead of annotated training data, participants are given a model trained on the annotations. Then, given unlabeled target domain data, they are asked to make predictions.

Website: https://machine-learning-for-medical-language.github.io/source-free-domain-adaptation/

CodaLab site: https://competitions.codalab.org/competitions/26152

Github repository: https://github.com/Machine-Learning-for-Medical-Language/source-free-domain-adaptation
f
DataSheet_1_HiRAND: A novel GCN semi-supervised deep learning-based...
frontiersin.figshare.com
pdf
Updated Jun 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yue Huang; Zhiwei Rong; Liuchao Zhang; Zhenyi Xu; Jianxin Ji; Jia He; Weisha Liu; Yan Hou; Kang Li (2023). DataSheet_1_HiRAND: A novel GCN semi-supervised deep learning-based framework for classification and feature selection in drug research and development.pdf [Dataset]. http://doi.org/10.3389/fonc.2023.1047556.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fonc.2023.1047556.s001
Dataset updated
Jun 21, 2023
Dataset provided by
Frontiers
Authors
Yue Huang; Zhiwei Rong; Liuchao Zhang; Zhenyi Xu; Jianxin Ji; Jia He; Weisha Liu; Yan Hou; Kang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The prediction of response to drugs before initiating therapy based on transcriptome data is a major challenge. However, identifying effective drug response label data costs time and resources. Methods available often predict poorly and fail to identify robust biomarkers due to the curse of dimensionality: high dimensionality and low sample size. Therefore, this necessitates the development of predictive models to effectively predict the response to drugs using limited labeled data while being interpretable. In this study, we report a novel Hierarchical Graph Random Neural Networks (HiRAND) framework to predict the drug response using transcriptome data of few labeled data and additional unlabeled data. HiRAND completes the information integration of the gene graph and sample graph by graph convolutional network (GCN). The innovation of our model is leveraging data augmentation strategy to solve the dilemma of limited labeled data and using consistency regularization to optimize the prediction consistency of unlabeled data across different data augmentations. The results showed that HiRAND achieved better performance than competitive methods in various prediction scenarios, including both simulation data and multiple drug response data. We found that the prediction ability of HiRAND in the drug vorinostat showed the best results across all 62 drugs. In addition, HiRAND was interpreted to identify the key genes most important to vorinostat response, highlighting critical roles for ribosomal protein-related genes in the response to histone deacetylase inhibition. Our HiRAND could be utilized as an efficient framework for improving the drug response prediction performance using few labeled data.
Data from: Learning protein fitness models from evolutionary and...
zenodo.org
datadryad.org
application/gzip
Updated Jun 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chloe Hsu; Chloe Hsu; Hunter Nisonoff; Clara Fannjiang; Clara Fannjiang; Jennifer Listgarten; Hunter Nisonoff; Jennifer Listgarten (2022). Learning protein fitness models from evolutionary and assay-labeled data [Dataset]. http://doi.org/10.6078/d1k71b
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6078/d1k71b
Dataset updated
Jun 5, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Chloe Hsu; Chloe Hsu; Hunter Nisonoff; Clara Fannjiang; Clara Fannjiang; Jennifer Listgarten; Hunter Nisonoff; Jennifer Listgarten
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Machine learning-based models of protein fitness typically learn from either unlabeled, evolutionarily-related sequences, or variant sequences with experimentally measured labels. For regimes where only limited experimental data are available, recent work has suggested methods for combining both sources of information. Toward that goal, we propose a simple combination approach that is competitive with, and on average outperforms more sophisticated methods. Our approach uses ridge regression on site-specific amino acid features combined with one density feature from modelling the evolutionary data. Within this approach, we find that a variational autoencoder-based density model showed the best overall performance, although any evolutionary density model can be used. Moreover, our analysis highlights the importance of systematic evaluations and sufficient baselines.
Data from: The COUGHVID crowdsourcing dataset: A corpus for the study of...
zenodo.org
ekoizpen-zientifikoa.ehu.eus
+1more
zip
Updated Aug 26, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lara Orlandic; Lara Orlandic; Tomas Teijeiro; Tomas Teijeiro; David Atienza; David Atienza (2022). The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms [Dataset]. http://doi.org/10.5281/zenodo.4498364
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4498364
Dataset updated
Aug 26, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lara Orlandic; Lara Orlandic; Tomas Teijeiro; Tomas Teijeiro; David Atienza; David Atienza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

Cough audio signal classification has been successfully used to diagnose a variety of respiratory conditions, and there has been significant interest in leveraging Machine Learning (ML) to provide widespread COVID-19 screening. The COUGHVID dataset provides over 20,000 crowdsourced cough recordings representing a wide range of subject ages, genders, geographic locations, and COVID-19 statuses. Furthermore, experienced pulmonologists labeled more than 2,000 recordings to diagnose medical abnormalities present in the coughs, thereby contributing one of the largest expert-labeled cough datasets in existence that can be used for a plethora of cough audio classification tasks. As a result, the COUGHVID dataset contributes a wealth of cough recordings for training ML models to address the world’s most urgent health crises.

Private Set and Testing Protocol

Researchers interested in testing their models on the private test dataset should contact us at coughvid@epfl.ch, briefly explaining the type of validation they wish to make, and their obtained results obtained through cross-validation with the public data. Then, access to the unlabeled recordings will be provided, and the researchers should send the predictions of their models on these recordings. Finally, the performance metrics of the predictions will be sent to the researchers. The private testing data is not included in any file within our Zenodo record, and it can only be accessed by contacting the COUGHVID team at the aforementioned e-mail address.
f
Table_4_sscNOVA: a semi-supervised convolutional neural network for...
figshare.com
xlsx
Updated Feb 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haibo Li; Zhenhua Yu; Fang Du; Lijuan Song; Yang Gao; Fangyuan Shi (2024). Table_4_sscNOVA: a semi-supervised convolutional neural network for predicting functional regulatory variants in autoimmune diseases.xlsx [Dataset]. http://doi.org/10.3389/fimmu.2024.1323072.s005
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fimmu.2024.1323072.s005
Dataset updated
Feb 6, 2024
Dataset provided by
Frontiers
Authors
Haibo Li; Zhenhua Yu; Fang Du; Lijuan Song; Yang Gao; Fangyuan Shi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Genome-wide association studies (GWAS) have identified thousands of variants in the human genome with autoimmune diseases. However, identifying functional regulatory variants associated with autoimmune diseases remains challenging, largely because of insufficient experimental validation data. We adopt the concept of semi-supervised learning by combining labeled and unlabeled data to develop a deep learning-based algorithm framework, sscNOVA, to predict functional regulatory variants in autoimmune diseases and analyze the functional characteristics of these regulatory variants. Compared to traditional supervised learning methods, our approach leverages more variants’ data to explore the relationship between functional regulatory variants and autoimmune diseases. Based on the experimentally curated testing dataset and evaluation metrics, we find that sscNOVA outperforms other state-of-the-art methods. Furthermore, we illustrate that sscNOVA can help to improve the prioritization of functional regulatory variants from lead single-nucleotide polymorphisms and the proxy variants in autoimmune GWAS data.
A
Unlabelled Weed Detection Images for Hot Peppers
data.amerigeoss.org
Updated Nov 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trinidad and Tobago (2022). Unlabelled Weed Detection Images for Hot Peppers [Dataset]. https://data.amerigeoss.org/dataset/weeddetection_hotpeppers
Explore at:
Dataset updated
Nov 1, 2022
Dataset provided by
Trinidad and Tobago
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data contains images of Capsicum Annuum that have been grown on several smallholder farms in Trinidad and Tobago ;showing different levels of weed cover and different weed species. In most instances, weeds can be recognized by the naked eye. However, there are times when the weeds and the crops are of similar species and may appear almost identical. When weeds are plentiful and interwoven with crops, it becomes increasingly difficult to determine weed cover on a given piece of land. This data can be used in research surrounding weed detection in hot peppers. When accompanied by the labelled versions, this data can be used to train machine learning models for identifying weed detection in Capsicum Annuum (Hot Peppers).

Facebook

Twitter

Click to copy link

Link copied

Cite

Juan Sebastián Cañas; Juan Sebastián Cañas; Toro-Gómez María Paula; Moreira Sugai Larissa Sayuri; Moreira Sugai Larissa Sayuri; Luis Felipe Toledo; De Souza Franco Leandro; Neckel De Oliveira Selvino; Neckel De Oliveira Selvino; Pereira Bastos Rogerio; Pereira Bastos Rogerio; Llusia Diego; Llusia Diego; Ulloa Juan Sebastián; Ulloa Juan Sebastián; Toro-Gómez María Paula; Luis Felipe Toledo; De Souza Franco Leandro (2024). Unlabeled AnuraSet: A dataset for leveraging unlabeled data in machine learning models for passive acoustic monitoring [Dataset]. http://doi.org/10.5281/zenodo.11244814

Unlabeled AnuraSet: A dataset for leveraging unlabeled data in machine learning models for passive acoustic monitoring

Explore at:

zip, txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.11244814

Dataset updated

May 27, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The Unlabeled AnuraSet (U-AnuraSet) is an extension of the original AnuraSet dataset. It consists of soundscape recordings from passive acoustic monitoring conducted in Brazil. The recording sites are identical to those in the original AnuraSet. Each site comprises 2,666 one-minute raw audio files of unlabeled data. The U-AnuraSet is publicly available to encourage machine learning researchers to explore innovative methods for leveraging unlabeled data in the training of models aimed at solving problems such as anuran call identification.

If you find the Unlabeled AnuraSet useful for your research, please consider citing it as follows:

Cañas, J.S., Toro-Gómez, M.P., Sugai, L.S.M., et al. A dataset for benchmarking Neotropical anuran calls identification in passive acoustic monitoring. Sci Data 10, 771 (2023). https://doi.org/10.1038/s41597-023-02666-2

Clear search

Close search

Google apps

Main menu

Unlabeled AnuraSet: A dataset for leveraging unlabeled data in machine...

Explanations for each cluster in Iris dataset.

Mal-Activity

Setting of parameters.

UCI and OpenML Data Sets for Ordinal Quantification

Data from: Exploring deep learning techniques for wild animal behaviour...

Data and code from: Learning a deep language model for microbiomes: The...

Data:

Product Reviews for Ordinal Quantification

Data from: Exploring deep learning techniques for wild animal behaviour...

Data from: Efficient Deep Learning Methods for Medical Image Analysis

Classification of gravure printed patterns using convolutional neural...

Data from: Amos: A large-scale abdominal multi-organ benchmark for versatile...

Bean Plant Pathologies Dataset for Deep Learning Tasks

UVP5 data sorted with EcoTaxa and MorphoCluster

Data from: SemEval-2021 Task 10: Source-Free Domain Adaptation for Semantic...

DataSheet_1_HiRAND: A novel GCN semi-supervised deep learning-based...

Data from: Learning protein fitness models from evolutionary and...

Data from: The COUGHVID crowdsourcing dataset: A corpus for the study of...

Table_4_sscNOVA: a semi-supervised convolutional neural network for...

Unlabelled Weed Detection Images for Hot Peppers

Unlabeled AnuraSet: A dataset for leveraging unlabeled data in machine learning models for passive acoustic monitoring