80 datasets found

f
Data from: Leveraging Unlabeled Data for Superior ROC Curve Estimation via a...
tandf.figshare.com
bin
Updated Feb 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Menghua Zhang; Mengjiao Peng; Yong Zhou (2025). Leveraging Unlabeled Data for Superior ROC Curve Estimation via a Semiparametric Approach [Dataset]. http://doi.org/10.6084/m9.figshare.28156199.v2
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28156199.v2
Dataset updated
Feb 26, 2025
Dataset provided by
Taylor & Francis
Authors
Menghua Zhang; Mengjiao Peng; Yong Zhou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The receiver operating characteristic (ROC) curve is a widely used tool in various fields, including economics, medicine, and machine learning, for evaluating classification performance and comparing treatment effect. The absence of clear and readily labels is a frequent phenomenon in estimating ROC owing to various reasons like labeling cost, time constraints, data privacy and information asymmetry. Traditional supervised estimators commonly rely solely on labeled data, where each sample is associated with a fully observed response variable. We propose a new set of semi-supervised (SS) estimators to exploit available unlabeled data (samples lack of observations for responses) to enhance the estimation precision under the semi-parametric setting assuming that the distribution of the response variable for one group is known up to unknown parameters. The newly proposed SS estimators have attractive properties such as adaptability and efficiency by leveraging the flexibility of kernel smoothing method. We establish the large sample properties of the SS estimators, which demonstrate that the SS estimators outperform the supervised estimator consistently under mild assumptions. Numeric experiments provide empirical evidence to support our theoretical findings. Finally, we showcase the practical applicability of our proposed methodology by applying it to two real datasets.
o
Amos: A large-scale abdominal multi-organ benchmark for versatile medical...
explore.openaire.eu
zenodo.org
Updated Oct 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
YuanfengJi (2022). Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation (Unlabeled Data Part III) [Dataset]. http://doi.org/10.5281/zenodo.7295816
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7295816
Dataset updated
Oct 29, 2022
Authors
YuanfengJi
Description
Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans in recent years, a comprehensive evaluation of the models' capabilities is hampered by the lack of a large-scale benchmark from diverse clinical scenarios. Constraint by the high cost of collecting and labeling 3D medical data, most of the deep learning models to date are driven by datasets with a limited number of organs of interest or samples, which still limits the power of modern deep models and makes it difficult to provide a fully comprehensive and fair estimate of various methods. To mitigate the limitations, we present AMOS, a large-scale, diverse, clinical dataset for abdominal organ segmentation. AMOS provides 500 CT and 100 MRI scans collected from multi-center, multi-vendor, multi-modality, multi-phase, multi-disease patients, each with voxel-level annotations of 15 abdominal organs, providing challenging examples and test-bed for studying robust segmentation algorithms under diverse targets and scenarios. We further benchmark several state-of-the-art medical segmentation models to evaluate the status of the existing methods on this new challenging dataset. We have made our datasets, benchmark servers, and baselines publicly available, and hope to inspire future research. The paper can be found at https://arxiv.org/pdf/2206.08023.pdf In addition to providing the labeled 600 CT and MRI scans, we expect to provide 2000 CT and 1200 MRI scans without labels to support more learning tasks (semi-supervised, un-supervised, domain adaption, ...). The link can be found in: labeled data (500CT+100MRI) unlabeled data Part I (900CT) unlabeled data Part II (1100CT) (Now there are 1000CT, we will replenish to 1100CT) unlabeled data Part III (1200MRI) if you found this dataset useful for your research, please cite: @article{ji2022amos, title={AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation}, author={Ji, Yuanfeng and Bai, Haotian and Yang, Jie and Ge, Chongjian and Zhu, Ye and Zhang, Ruimao and Li, Zhen and Zhang, Lingyan and Ma, Wanling and Wan, Xiang and others}, journal={arXiv preprint arXiv:2206.08023}, year={2022} }
a
Stanford STL-10 Image Dataset
academictorrents.com
bittorrent
Updated Nov 26, 2015
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adam Coates and Honglak Lee and Andrew Y. Ng (2015). Stanford STL-10 Image Dataset [Dataset]. https://academictorrents.com/details/a799a2845ac29a66c07cf74e2a2838b6c5698a6a
Explore at:
bittorrent(2640397119)Available download formats
Dataset updated
Nov 26, 2015
Dataset authored and provided by
Adam Coates and Honglak Lee and Andrew Y. Ng
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
![]() The STL-10 dataset is an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms. It is inspired by the CIFAR-10 dataset but with some modifications. In particular, each class has fewer labeled training examples than in CIFAR-10, but a very large set of unlabeled examples is provided to learn image models prior to supervised training. The primary challenge is to make use of the unlabeled data (which comes from a similar but different distribution from the labeled data) to build a useful prior. We also expect that the higher resolution of this dataset (96x96) will make it a challenging benchmark for developing more scalable unsupervised learning methods. Overview 10 classes: airplane, bird, car, cat, deer, dog, horse, monkey, ship, truck. Images are 96x96 pixels, color. 500 training images (10 pre-defined folds), 800 test images per class. 100000 unlabeled images for uns
H
Replication Data for: Measuring the Significance of Policy Outputs with...
dataverse.harvard.edu
zip
Updated Oct 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2020). Replication Data for: Measuring the Significance of Policy Outputs with Positive Unlabeled Learning [Dataset]. http://doi.org/10.7910/DVN/1XXDMW
Explore at:
zip(641137424)Available download formats
Unique identifier
https://doi.org/10.7910/DVN/1XXDMW
Dataset updated
Oct 19, 2020
Dataset provided by
Harvard Dataverse
License
Description
Identifying important policy outputs has long been of interest to political scientists. In this work, we propose a novel approach to the classification of policies. Instead of obtaining and aggregating expert evaluations of significance for a finite set of policy outputs, we use experts to identify a small set of significant outputs and then employ positive unlabeled (PU) learning to search for other similar examples in a large unlabeled set. We further propose to automate the first step by harvesting ‘seed’ sets of significant outputs from web data. We offer an application of the new approach by classifying over 9,000 government regulations in the United Kingdom. The obtained estimates are successfully validated against human experts, by forecasting web citations, and with a construct validity test.
f
Data from: Benchmarking Machine Learning Models for Polymer Informatics: An...
acs.figshare.com
xlsx
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lei Tao; Vikas Varshney; Ying Li (2023). Benchmarking Machine Learning Models for Polymer Informatics: An Example of Glass Transition Temperature [Dataset]. http://doi.org/10.1021/acs.jcim.1c01031.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.1c01031.s002
Dataset updated
Jun 4, 2023
Dataset provided by
ACS Publications
Authors
Lei Tao; Vikas Varshney; Ying Li
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
In the field of polymer informatics, utilizing machine learning (ML) techniques to evaluate the glass transition temperature Tg and other properties of polymers has attracted extensive attention. This data-centric approach is much more efficient and practical than the laborious experimental measurements when encountered a daunting number of polymer structures. Various ML models are demonstrated to perform well for Tg prediction. Nevertheless, they are trained on different data sets, using different structure representations, and based on different feature engineering methods. Thus, the critical question arises on selecting a proper ML model to better handle the Tg prediction with generalization ability. To provide a fair comparison of different ML techniques and examine the key factors that affect the model performance, we carry out a systematic benchmark study by compiling 79 different ML models and training them on a large and diverse data set. The three major components in setting up an ML model are structure representations, feature representations, and ML algorithms. In terms of polymer structure representation, we consider the polymer monomer, repeat unit, and oligomer with longer chain structure. Based on that feature, representation is calculated, including Morgan fingerprinting with or without substructure frequency, RDKit descriptors, molecular embedding, molecular graph, etc. Afterward, the obtained feature input is trained using different ML algorithms, such as deep neural networks, convolutional neural networks, random forest, support vector machine, LASSO regression, and Gaussian process regression. We evaluate the performance of these ML models using a holdout test set and an extra unlabeled data set from high-throughput molecular dynamics simulation. The ML model’s generalization ability on an unlabeled data set is especially focused, and the model’s sensitivity to topology and the molecular weight of polymers is also taken into consideration. This benchmark study provides not only a guideline for the Tg prediction task but also a useful reference for other polymer informatics tasks.
f
Data from: Coupled generation*
figshare.com
application/gzip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ben Dai; Xiaotong Shen; Wing Wong (2023). Coupled generation* [Dataset]. http://doi.org/10.6084/m9.figshare.13179905.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13179905.v1
Dataset updated
May 30, 2023
Dataset provided by
Taylor & Francis
Authors
Ben Dai; Xiaotong Shen; Wing Wong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Instance generation creates representative examples to interpret a learning model, as in regression and classification. For example, representative sentences of a topic of interest describe the topic specifically for sentence categorization. In such a situation, a large number of unlabeled observations may be available in addition to labeled data, for example, many unclassified text corpora (unlabeled instances) are available with only a few classified sentences (labeled instances). In this article, we introduce a novel generative method, called a coupled generator, producing instances given a specific learning outcome, based on indirect and direct generators. The indirect generator uses the inverse principle to yield the corresponding inverse probability, enabling to generate instances by leveraging an unlabeled data. The direct generator learns the distribution of an instance given its learning outcome. Then, the coupled generator seeks the best one from the indirect and direct generators, which is designed to enjoy the benefits of both and deliver higher generation accuracy. For sentence generation given a topic, we develop an embedding-based regression/classification in conjuncture with an unconditional recurrent neural network for the indirect generator, whereas a conditional recurrent neural network is natural for the corresponding direct generator. Moreover, we derive finite-sample generation error bounds for the indirect and direct generators to reveal the generative aspects of both methods thus explaining the benefits of the coupled generator. Finally, we apply the proposed methods to a real benchmark of abstract classification and demonstrate that the coupled generator composes reasonably good sentences from a dictionary to describe a specific topic of interest.
f
Data_Sheet_1_Building One-Shot Semi-Supervised (BOSS) Learning Up to Fully...
frontiersin.figshare.com
pdf
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leslie N. Smith; Adam Conovaloff (2023). Data_Sheet_1_Building One-Shot Semi-Supervised (BOSS) Learning Up to Fully Supervised Performance.pdf [Dataset]. http://doi.org/10.3389/frai.2022.880729.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2022.880729.s001
Dataset updated
May 30, 2023
Dataset provided by
Frontiers
Authors
Leslie N. Smith; Adam Conovaloff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reaching the performance of fully supervised learning with unlabeled data and only labeling one sample per class might be ideal for deep learning applications. We demonstrate for the first time the potential for building one-shot semi-supervised (BOSS) learning on CIFAR-10 and SVHN up to attain test accuracies that are comparable to fully supervised learning. Our method combines class prototype refining, class balancing, and self-training. A good prototype choice is essential and we propose a technique for obtaining iconic examples. In addition, we demonstrate that class balancing methods substantially improve accuracy results in semi-supervised learning to levels that allow self-training to reach the level of fully supervised learning performance. Our experiments demonstrate the value with computing and analyzing test accuracies for every class, rather than only a total test accuracy. We show that our BOSS methodology can obtain total test accuracies with CIFAR-10 images and only one labeled sample per class up to 95% (compared to 94.5% for fully supervised). Similarly, the SVHN images obtains test accuracies of 97.8%, compared to 98.27% for fully supervised. Rigorous empirical evaluations provide evidence that labeling large datasets is not necessary for training deep neural networks. Our code is available at https://github.com/lnsmith54/BOSS to facilitate replication.
Z
OSBM: Optical sectioning of unlabeled samples using bright-field microscopy
data.niaid.nih.gov
zenodo.org
Updated Feb 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gutiérrez-Medina, Braulio (2022). OSBM: Optical sectioning of unlabeled samples using bright-field microscopy [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5931507
Explore at:
Dataset updated
Feb 2, 2022
Dataset authored and provided by
Gutiérrez-Medina, Braulio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This includes the FIJI macro and the image stack dataset used in the manuscript "Optical sectioning of unlabeled samples using bright-field microscopy".
Z
UCI and OpenML Data Sets for Ordinal Quantification
data.niaid.nih.gov
zenodo.org
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bunse, Mirko (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8177301
Explore at:
Dataset updated
Jul 25, 2023
Dataset provided by
Senz, Martin
Moreo, Alejandro
Bunse, Mirko
Sebastiani, Fabrizio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
t
Unlabeled samples generated by GAN improve the person re-identification...
service.tib.eu
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Unlabeled samples generated by GAN improve the person re-identification baseline in vitro - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/unlabeled-samples-generated-by-gan-improve-the-person-re-identification-baseline-in-vitro
Explore at:
Dataset updated
Dec 2, 2024
Description
A dataset for unsupervised person re-identification using Generative Adversarial Networks (GANs).
d
Unlabelled training datasets of AIS Trajectories from Danish Waters for...
data.dtu.dk
bin
Updated Jul 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kristoffer Vinther Olesen; Line Katrine Harder Clemmensen; Anders Nymark Christensen (2023). Unlabelled training datasets of AIS Trajectories from Danish Waters for Abnormal Behavior Detection [Dataset]. http://doi.org/10.11583/DTU.21511842.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.11583/DTU.21511842.v1
Dataset updated
Jul 10, 2023
Dataset provided by
Technical University of Denmark
Authors
Kristoffer Vinther Olesen; Line Katrine Harder Clemmensen; Anders Nymark Christensen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This item is part of the collection "AIS Trajectories from Danish Waters for Abnormal Behavior Detection"

DOI: https://doi.org/10.11583/DTU.c.6287841

Using Deep Learning for detection of maritime abnormal behaviour in spatio temporal trajectories is a relatively new and promising application. Open access to the Automatic Identification System (AIS) has made large amounts of maritime trajectories publically avaliable. However, these trajectories are unannotated when it comes to the detection of abnormal behaviour.

The lack of annotated datasets for abnormality detection on maritime trajectories makes it difficult to evaluate and compare suggested models quantitavely. With this dataset, we attempt to provide a way for researchers to evaluate and compare performance.

We have manually labelled trajectories which showcase abnormal behaviour following an collision accident. The annotated dataset consists of 521 data points with 25 abnormal trajectories. The abnormal trajectories cover amoung other; Colliding vessels, vessels engaged in Search-and-Rescue activities, law enforcement, and commercial maritime traffic forced to deviate from the normal course

These datasets consists of unlabelled trajectories for the purpose of training unsupervised models. For labelled datasets for evaluation please refer to the collection. Link in Related publications.

The data is saved using the pickle format for Python Each dataset is split into 2 files with naming convention:

datasetInfo_XXX
data_XXX

Files named "data_XXX" contains the extracted trajectories serialized sequentially one at a time and must be read as such. Please refer to provided utility functions for examples. Files named "datasetInfo" contains Metadata related to the dataset and indecies at which trajectories begin in "data_XXX" files.

The data are sequences of maritime trajectories defined by their; timestamp, latitude/longitude position, speed, course, and unique ship identifer MMSI. In addition, the dataset contains metadata related to creation parameters. The dataset has been limited to a specific time period, ship types, moving AIS navigational statuses, and filtered within an region of interest (ROI). Trajectories were split if exceeding an upper limit and short trajectories were discarded. All values are given as metadata in the dataset and used in the naming syntax.

Naming syntax: data_AIS_Custom_STARTDATE_ENDDATE_SHIPTYPES_MINLENGTH_MAXLENGTH_RESAMPLEPERIOD.pkl

See datasheet for more detailed information and we refer to provided utility functions for examples on how to read and plot the data.
f
DataSheet_1_HiRAND: A novel GCN semi-supervised deep learning-based...
frontiersin.figshare.com
pdf
Updated Jun 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yue Huang; Zhiwei Rong; Liuchao Zhang; Zhenyi Xu; Jianxin Ji; Jia He; Weisha Liu; Yan Hou; Kang Li (2023). DataSheet_1_HiRAND: A novel GCN semi-supervised deep learning-based framework for classification and feature selection in drug research and development.pdf [Dataset]. http://doi.org/10.3389/fonc.2023.1047556.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fonc.2023.1047556.s001
Dataset updated
Jun 21, 2023
Dataset provided by
Frontiers
Authors
Yue Huang; Zhiwei Rong; Liuchao Zhang; Zhenyi Xu; Jianxin Ji; Jia He; Weisha Liu; Yan Hou; Kang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The prediction of response to drugs before initiating therapy based on transcriptome data is a major challenge. However, identifying effective drug response label data costs time and resources. Methods available often predict poorly and fail to identify robust biomarkers due to the curse of dimensionality: high dimensionality and low sample size. Therefore, this necessitates the development of predictive models to effectively predict the response to drugs using limited labeled data while being interpretable. In this study, we report a novel Hierarchical Graph Random Neural Networks (HiRAND) framework to predict the drug response using transcriptome data of few labeled data and additional unlabeled data. HiRAND completes the information integration of the gene graph and sample graph by graph convolutional network (GCN). The innovation of our model is leveraging data augmentation strategy to solve the dilemma of limited labeled data and using consistency regularization to optimize the prediction consistency of unlabeled data across different data augmentations. The results showed that HiRAND achieved better performance than competitive methods in various prediction scenarios, including both simulation data and multiple drug response data. We found that the prediction ability of HiRAND in the drug vorinostat showed the best results across all 62 drugs. In addition, HiRAND was interpreted to identify the key genes most important to vorinostat response, highlighting critical roles for ribosomal protein-related genes in the response to histone deacetylase inhibition. Our HiRAND could be utilized as an efficient framework for improving the drug response prediction performance using few labeled data.
h
VietMed_unlabeled
huggingface.co
Updated Jul 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Phan Tuấn Anh (2025). VietMed_unlabeled [Dataset]. https://huggingface.co/datasets/doof-ferb/VietMed_unlabeled
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 8, 2025
Authors
Phan Tuấn Anh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
unofficial mirror of VietMed (Vietnamese speech data in medical domain) unlabeled set

official announcement: https://arxiv.org/abs/2404.05659 official download: https://huggingface.co/datasets/leduckhai/VietMed this repo contains the unlabeled set: 966h - 230k samples i also gather the metadata: see info.csv my extraction code: https://github.com/phineas-pta/fine-tune-whisper-vi/blob/main/misc/vietmed-unlabeled.py need to do: check misspelling, restore foreign words phonetised to… See the full description on the dataset page: https://huggingface.co/datasets/doof-ferb/VietMed_unlabeled.
d
Fish Detection AI, Optic and Sonar-trained Object Detection Models
catalog.data.gov
data.openei.org
+1more
Updated May 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Water Power Technology Office (2025). Fish Detection AI, Optic and Sonar-trained Object Detection Models [Dataset]. https://catalog.data.gov/dataset/fish-detection-ai-optic-and-sonar-trained-object-detection-models
Explore at:
Dataset updated
May 22, 2025
Dataset provided by
Water Power Technology Office
Description
The Fish Detection AI project aims to improve the efficiency of fish monitoring around marine energy facilities to comply with regulatory requirements. Despite advancements in computer vision, there is limited focus on sonar images, identifying small fish with unlabeled data, and methods for underwater fish monitoring for marine energy. A YOLO (You Only Look Once) computer vision model was developed using the Eyesea dataset (optical) and sonar images from Alaska Fish and Games to identify fish in underwater environments. Supervised methods were used within YOLO to detect fish based on training using labeled data of fish. These trained models were then applied to different unseen datasets, aiming to reduce the need for labeling datasets and training new models for various locations. Additionally, hyper-image analysis and various image preprocessing methods were explored to enhance fish detection. In this research we achieved: 1. Enhanced YOLO Performance, as compared to a published article (Xu, Matzner 2018) using earlier yolo versions for fish object identification. Specifically, we achieved a best mean Average Precision (mAP) of 0.68 on the Eyesea optical dataset using YOLO v8 (medium-sized model), surpassing previous YOLO v3 benchmarks from that previous article publication. We further demonstrated up to 0.65 mAP on unseen sonar domains by leveraging a hyper-image approach (stacking consecutive frames), showing promising cross-domain adaptability. This submission of data includes: - The actual best-performing trained YOLO model neural network weights, which can be applied to do object detection (PyTorch files, .pt). These are found in the Yolo_models_downloaded zip file - Documentation file to explain the upload and the goals of each of the experiments 1-5, as detailed in the word document (named "Yolo_Object_Detection_How_To_Document.docx") - Coding files, namely 5 sub-folders of python, shell, and yaml files that were used to run the experiments 1-5, as well as a separate folder for yolo models. Each of these is found in their own zip file, named after each experiment - Sample data structures (sample1 and sample2, each with their own zip file) to show how the raw data should be structured after running our provided code on the raw downloaded data - link to the article that we were replicating (Xu, Matzner 2018) - link to the Yolo documentation site from the original creators of that model (ultralytics) - link to the downloadable EyeSea data set from PNNL (instructions on how to download and format the data in the right way to be able to replicate these experiments is found in the How To word document)
h
bluesky-journalist-classification
huggingface.co
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruggero Marino Lazzaroni (2025). bluesky-journalist-classification [Dataset]. https://huggingface.co/datasets/ruggsea/bluesky-journalist-classification
Explore at:
Dataset updated
Jul 8, 2025
Authors
Ruggero Marino Lazzaroni
Description
Bluesky Journalist Classification Dataset

Dataset Description

This dataset contains Bluesky user profiles for training and evaluating journalist classification models. Created for the CSH Vienna Machine Learning Workshop, it includes comprehensive user data with human-verified labels for binary classification tasks.

Dataset Summary

Total Examples: 1,189 Test Split: 229 labeled examples
Unlabeled Split: 960 unlabeled examples Languages: Primarily English… See the full description on the dataset page: https://huggingface.co/datasets/ruggsea/bluesky-journalist-classification.
d
Intersection Safety Challenge Stage 1B Sample
catalog.data.gov
data.virginia.gov
+2more
Updated Jul 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
US Department of Transportation (2025). Intersection Safety Challenge Stage 1B Sample [Dataset]. https://catalog.data.gov/dataset/intersection-safety-challenge-stage-1b-training-data-sample-47af1
Explore at:
Dataset updated
Jul 9, 2025
Dataset provided by
US Department of Transportation
Description
This dataset was collected as part of the U.S. Department of Transportation (U.S. DOT) Intersection Safety Challenge (hereafter, “the Challenge”) for Stage 1B: System Assessment and Virtual Testing. Multi-sensor data were collected at a controlled test roadway intersection the Federal Highway Administration (FHWA) Turner-Fairbank Highway Research Center (TFHRC) Smart Intersection facility in McLean, VA from October 2023 through March 2024. The data include potential conflict-based and non-conflict-based experimental scenarios between vulnerable road users (e.g., pedestrians, bicyclists) and vehicles during both daytime and nighttime conditions. Note that no actual human vulnerable road users were put at risk of being involved in a collision during the data collection efforts. The provided data (hereafter, “the Challenge Dataset”) are unlabeled training data (without ground truth) that were collected to be used for intersection safety system algorithm training, refinement, tuning, and/or validation, but may have additional uses. For a summary of the Stage 1B data collection effort, please see this video: https://youtu.be/csirVHFa2Cc. The Challenge Dataset includes data at a single, signalized four-way intersection from 20 roadside sensors and traffic control devices, including eight closed-circuit television (CCTV) visual cameras, five thermal cameras, two light detection and ranging (LiDAR) sensors, and four radar sensors. Intrinsic calibration was performed for all visual and thermal cameras. Extrinsic calibration was performed for specific pairs of roadside sensors. Additionally, the traffic signal phase and timing data and vehicle and/or pedestrian calls to the traffic signal controller (if any) are also provided. The total number of unique runs in the Challenge Dataset is 1,104, bringing the total size of the dataset to approximately 1 TB. A sample of 20 unique runs from the Challenge Dataset is provided here for download, inspection, and use. If, after inspecting this sample, a potential data user would like access to download the full Challenge Dataset, a request can be made via the form here: https://its.dot.gov/data/data-request For more details about the data collection, supplemental files, organization and dictionary, and sensor calibration, see the attached “U.S. DOT ISC Stage 1B ITS DataHub Metadata_v1.0.pdf” document. For more information on the background of the Intersection Safety Challenge Stage 1B, please visit: https://www.its.dot.gov/research-areas/Intersection-Safety-Challenge/.
n
Data and code from: Learning a deep language model for microbiomes: The...
data.niaid.nih.gov
dataone.org
+1more
zip
Updated Feb 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Quintin Pope; Rohan Varma; Christine Tataru; Maude David; Xiaoli Fern (2025). Data and code from: Learning a deep language model for microbiomes: The power of large scale unlabeled microbiome data [Dataset]. http://doi.org/10.5061/dryad.tb2rbp08p
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.tb2rbp08p
Dataset updated
Feb 20, 2025
Dataset provided by
Oregon State University
University of Michigan
Authors
Quintin Pope; Rohan Varma; Christine Tataru; Maude David; Xiaoli Fern
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
We use open source human gut microbiome data to learn a microbial “language” model by adapting techniques from Natural Language Processing (NLP). Our microbial “language” model is trained in a self-supervised fashion (i.e., without additional external labels) to capture the interactions among different microbial taxa and the common compositional patterns in microbial communities. The learned model produces contextualized taxon representations that allow a single microbial taxon to be represented differently according to the specific microbial environment in which it appears. The model further provides a sample representation by collectively interpreting different microbial taxa in the sample and their interactions as a whole. We demonstrate that, while our sample representation performs comparably to baseline models in in-domain prediction tasks such as predicting Irritable Bowel Disease (IBD) and diet patterns, it significantly outperforms them when generalizing to test data from independent studies, even in the presence of substantial distribution shifts. Through a variety of analyses, we further show that the pre-trained, context-sensitive embedding captures meaningful biological information, including taxonomic relationships, correlations with biological pathways, and relevance to IBD expression, despite the model never being explicitly exposed to such signals. Methods No additional raw data was collected for this project. All inputs are available publicly. American Gut Project, Halfvarson, and Schirmer raw data are available from the NCBI database (accession numbers PRJEB11419, PRJEB18471, and PRJNA398089, respectively). We used the curated data produced by Tataru and David, 2020.
Machine Learning Courses Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Machine Learning Courses Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/machine-learning-courses-market
Explore at:
csv, pptx, pdfAvailable download formats
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Machine Learning Courses Market Outlook

The global market size of Machine Learning (ML) courses is witnessing substantial growth, with market valuation expected to reach $3.1 billion in 2023 and projected to soar to $12.6 billion by 2032, exhibiting a robust CAGR of 16.5% over the forecast period. This rapid expansion is fueled by the increasing adoption of artificial intelligence (AI) and machine learning technologies across various industries, the rising need for upskilling and reskilling in the workforce, and the growing penetration of online education platforms.

One of the most significant growth factors driving the ML courses market is the escalating demand for AI and ML expertise in the job market. As industries increasingly integrate AI and machine learning into their operations to enhance efficiency and innovation, there is a burgeoning need for professionals with relevant skills. Companies across sectors such as finance, healthcare, retail, and manufacturing are investing heavily in training programs to bridge the skills gap, thus driving the demand for ML courses. Additionally, the rapid evolution of technology necessitates continuous learning, further bolstering market growth.

Another crucial factor contributing to the market's expansion is the proliferation of online education platforms that offer flexible and affordable ML courses. Platforms like Coursera, Udacity, edX, and Khan Academy have made high-quality education accessible to a global audience. These platforms offer an array of courses tailored to different skill levels, from beginners to advanced learners, making it easier for individuals to pursue continuous learning and career advancement. The convenience and flexibility of online learning are particularly appealing to working professionals and students, thereby driving the market's growth.

The increasing collaboration between educational institutions and technology companies is also playing a pivotal role in the growth of the ML courses market. Many universities and colleges are partnering with leading tech firms to develop specialized curricula that align with industry requirements. These collaborations help ensure that the courses offered are up-to-date with the latest technological advancements and industry standards. As a result, students and professionals are better equipped with the skills needed to thrive in a technology-driven job market, further propelling the demand for ML courses.

On a regional level, North America holds a significant share of the ML courses market, driven by the presence of numerous leading tech companies and educational institutions, as well as a highly skilled workforce. The region's strong emphasis on innovation and technological advancement is a key driver of market growth. Additionally, Asia Pacific is emerging as a lucrative market for ML courses, with countries like China, India, and Japan witnessing increased investments in AI and ML education and training. The rising internet penetration, growing popularity of online education, and government initiatives to promote digital literacy are some of the factors contributing to the market's growth in this region.

Self-Supervised Learning, a cutting-edge approach in the realm of machine learning, is gaining traction as a pivotal element in the development of more autonomous AI systems. Unlike traditional supervised learning, which relies heavily on labeled data, self-supervised learning leverages unlabeled data to train models, significantly reducing the dependency on human intervention for data annotation. This method is particularly advantageous in scenarios where acquiring labeled data is costly or impractical. By enabling models to learn from vast amounts of unlabeled data, self-supervised learning enhances the ability of AI systems to generalize from limited labeled examples, thereby improving their performance in real-world applications. The integration of self-supervised learning techniques into machine learning courses is becoming increasingly important, as it equips learners with the knowledge to tackle complex AI challenges and develop more robust models.

Course Type Analysis

The Machine Learning Courses market is segmented by course type into online courses, offline courses, bootcamps, and workshops. Online courses dominate the segment due to their accessibility, flexibility, and cost-effectiveness. Platforms like Coursera and Udacity have democratized access to high-quality ML education, enabling lear
d
Data from: MULTI-TEMPORAL REMOTE SENSING IMAGE CLASSIFICATION - A MULTI-VIEW...
catalog.data.gov
datasets.ai
+2more
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). MULTI-TEMPORAL REMOTE SENSING IMAGE CLASSIFICATION - A MULTI-VIEW APPROACH [Dataset]. https://catalog.data.gov/dataset/multi-temporal-remote-sensing-image-classification-a-multi-view-approach
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
MULTI-TEMPORAL REMOTE SENSING IMAGE CLASSIFICATION - A MULTI-VIEW APPROACH VARUN CHANDOLA AND RANGA RAJU VATSAVAI Abstract. Multispectral remote sensing images have been widely used for automated land use and land cover classification tasks. Often thematic classification is done using single date image, however in many instances a single date image is not informative enough to distinguish between different land cover types. In this paper we show how one can use multiple images, collected at different times of year (for example, during crop growing season), to learn a better classifier. We propose two approaches, an ensemble of classifiers approach and a co-training based approach, and show how both of these methods outperform a straightforward stacked vector approach often used in multi-temporal image classification. Additionally, the co-training based method addresses the challenge of limited labeled training data in supervised classification, as this classification scheme utilizes a large number of unlabeled samples (which comes for free) in conjunction with a small set of labeled training data.
d
Data and code to run birdnet-discovery, a pipeline for signal discovery and...
datadryad.org
zip
Updated Mar 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Morgan Ziegenhorn; Richard Lanctot; Stephen Brown; Miles Brengle; Shiloh Schulte; Sarah Saalfeld; Christopher Latty; Paul Smith; Nicolas Lecomte (2025). Data and code to run birdnet-discovery, a pipeline for signal discovery and training dataset creation using BirdNET embeddings, including example data from acoustic ARUs in Northern Alaska [Dataset]. http://doi.org/10.5061/dryad.jh9w0vtnr
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.jh9w0vtnr
Dataset updated
Mar 24, 2025
Dataset provided by
Dryad
Authors
Morgan Ziegenhorn; Richard Lanctot; Stephen Brown; Miles Brengle; Shiloh Schulte; Sarah Saalfeld; Christopher Latty; Paul Smith; Nicolas Lecomte
Area covered
Arctic Alaska, Alaska
Description
Details on data collection can be found in the manuscript associated with this dataset.

Facebook

Twitter

Click to copy link

Link copied

Cite

Menghua Zhang; Mengjiao Peng; Yong Zhou (2025). Leveraging Unlabeled Data for Superior ROC Curve Estimation via a Semiparametric Approach [Dataset]. http://doi.org/10.6084/m9.figshare.28156199.v2

Data from: Leveraging Unlabeled Data for Superior ROC Curve Estimation via a Semiparametric Approach

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.28156199.v2

Dataset updated

Feb 26, 2025

Dataset provided by

Taylor & Francis

Authors

Menghua Zhang; Mengjiao Peng; Yong Zhou

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The receiver operating characteristic (ROC) curve is a widely used tool in various fields, including economics, medicine, and machine learning, for evaluating classification performance and comparing treatment effect. The absence of clear and readily labels is a frequent phenomenon in estimating ROC owing to various reasons like labeling cost, time constraints, data privacy and information asymmetry. Traditional supervised estimators commonly rely solely on labeled data, where each sample is associated with a fully observed response variable. We propose a new set of semi-supervised (SS) estimators to exploit available unlabeled data (samples lack of observations for responses) to enhance the estimation precision under the semi-parametric setting assuming that the distribution of the response variable for one group is known up to unknown parameters. The newly proposed SS estimators have attractive properties such as adaptability and efficiency by leveraging the flexibility of kernel smoothing method. We establish the large sample properties of the SS estimators, which demonstrate that the SS estimators outperform the supervised estimator consistently under mild assumptions. Numeric experiments provide empirical evidence to support our theoretical findings. Finally, we showcase the practical applicability of our proposed methodology by applying it to two real datasets.

Clear search

Close search

Google apps

Main menu

Data from: Leveraging Unlabeled Data for Superior ROC Curve Estimation via a...

Amos: A large-scale abdominal multi-organ benchmark for versatile medical...

Stanford STL-10 Image Dataset

Replication Data for: Measuring the Significance of Policy Outputs with...

Data from: Benchmarking Machine Learning Models for Polymer Informatics: An...

Data from: Coupled generation*

Data_Sheet_1_Building One-Shot Semi-Supervised (BOSS) Learning Up to Fully...

OSBM: Optical sectioning of unlabeled samples using bright-field microscopy

UCI and OpenML Data Sets for Ordinal Quantification

Unlabeled samples generated by GAN improve the person re-identification...

Unlabelled training datasets of AIS Trajectories from Danish Waters for...

DataSheet_1_HiRAND: A novel GCN semi-supervised deep learning-based...

VietMed_unlabeled

Fish Detection AI, Optic and Sonar-trained Object Detection Models

bluesky-journalist-classification

Intersection Safety Challenge Stage 1B Sample

Data and code from: Learning a deep language model for microbiomes: The...

Machine Learning Courses Market Report | Global Forecast From 2025 To 2033

Machine Learning Courses Market Outlook

Course Type Analysis

Data from: MULTI-TEMPORAL REMOTE SENSING IMAGE CLASSIFICATION - A MULTI-VIEW...

Data and code to run birdnet-discovery, a pipeline for signal discovery and...

Data from: Leveraging Unlabeled Data for Superior ROC Curve Estimation via a Semiparametric Approach