72 datasets found

Unlabelled dataset
kaggle.com
Updated Oct 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Diggers (2023). Unlabelled dataset [Dataset]. https://www.kaggle.com/datasets/ahmedaliraja/unlabelled-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 29, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Data Diggers
Description
This dataset consists of unlabeled data representing various data points collected from different sources and domains. The dataset serves as a blank canvas for unsupervised learning experiments, allowing for the exploration of patterns, clusters, and hidden insights through various data analysis techniques. Researchers and data enthusiasts can use this dataset to develop and test unsupervised learning algorithms, identify underlying structures, and gain a deeper understanding of data without predefined labels.
STL10-Labeled Image Recognition Dataset
kaggle.com
Updated Aug 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Semih Yagli (2025). STL10-Labeled Image Recognition Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/12688697
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/12688697
Dataset updated
Aug 6, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Semih Yagli
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This public dataset contains labels for the unlabeled 100,000 pictures in the STL-10 dataset.

The dataset is human labeled with AI aid through Etiqueta, the one and only gamified mobile data labeling application. stl10.py is a python script written by Martin Tutek to download the complete STL10 dataset. labels.json contains labels for the 100,000 previously unlabeled images in the STL10 dataset legend.json is a mapping of the labels used. stats.ipynb presents a few statistics regarding the 100,000 newly labeled images.

If you use this dataset in your research please cite the following:

@techreport{yagli2025etiqueta, author = {Semih Yagli}, title = {Etiqueta: AI-Aided, Gamified Data Labeling to Label and Segment Data}, year = {2025}, number = {TR-2025-0001}, address = {NJ, USA}, month = Apr., url = {https://www.aidatalabel.com/technical_reports/aidatalabel_tr_2025_0001.pdf}, institution = {AI Data Label}, } @inproceedings{coates2011analysis, title = {An analysis of single-layer networks in unsupervised feature learning}, author = {Coates, Adam and Ng, Andrew and Lee, Honglak}, booktitle = {Proceedings of the fourteenth international conference on artificial intelligence and statistics}, pages = {215--223}, year = {2011}, organization = {JMLR Workshop and Conference Proceedings} }

Note: The dataset is imported to Kaggle from: https://github.com/semihyagli/STL10-Labeled See also: https://github.com/semihyagli/STL10_Segmentation

If you have comments and questions about Etiqueta or about this dataset, please reach us out at contact@aidatalabel.com
UCI and OpenML Data Sets for Ordinal Quantification
data.europa.eu
data.niaid.nih.gov
+1more
unknown
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-8177302?locale=nl
Explore at:
unknown(25639502)Available download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data. With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label. We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample. Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed. Usage You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python. Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments. Data Extraction: In your terminal, you can call either make (recommended), or julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class. Further Reading Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
f
Average dice coefficients of the few-supervised learning models using 2%,...
plos.figshare.com
xls
Updated Sep 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seung-Ah Lee; Hyun Su Kim; Ehwa Yang; Young Cheol Yoon; Ji Hyun Lee; Byung-Ok Choi; Jae-Hun Kim (2024). Average dice coefficients of the few-supervised learning models using 2%, 5%, and 10% of the labeled data, and semi-supervised learning models using 10% of the labeled data for training. [Dataset]. http://doi.org/10.1371/journal.pone.0310203.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310203.t002
Dataset updated
Sep 6, 2024
Dataset provided by
PLOS ONE
Authors
Seung-Ah Lee; Hyun Su Kim; Ehwa Yang; Young Cheol Yoon; Ji Hyun Lee; Byung-Ok Choi; Jae-Hun Kim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Average dice coefficients of the few-supervised learning models using 2%, 5%, and 10% of the labeled data, and semi-supervised learning models using 10% of the labeled data for training.
Active Evaluation Software for Selection of Ground Truth Labels
catalog.data.gov
s.cnmilf.com
Updated Jul 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2022). Active Evaluation Software for Selection of Ground Truth Labels [Dataset]. https://catalog.data.gov/dataset/active-evaluation-software-for-selection-of-ground-truth-labels-d0581
Explore at:
Dataset updated
Jul 29, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This software repository contains a python package Aegis (Active Evaluator Germane Interactive Selector) package that allows us to evaluate machine learning systems's performance (according to a metric such as accuracy) by adaptively sampling trials to label from an unlabeled test set to minimize the number of labels needed. This includes sample (public) data as well as a simulation script that tests different label-selecting strategies on already labelled test sets. This software is configured so that users can add their own data and system outputs to test evaluation.
f
Classification results of the hybrid model on the synthetic data.
plos.figshare.com
bin
Updated Jun 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moein E. Samadi; Sandra Kiefer; Sebastian Johaness Fritsch; Johannes Bickenbach; Andreas Schuppert (2023). Classification results of the hybrid model on the synthetic data. [Dataset]. http://doi.org/10.1371/journal.pone.0274569.t003
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0274569.t003
Dataset updated
Jun 16, 2023
Dataset provided by
PLOS ONE
Authors
Moein E. Samadi; Sandra Kiefer; Sebastian Johaness Fritsch; Johannes Bickenbach; Andreas Schuppert
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification results of the hybrid model on the synthetic data.
H
Replication Data for: Improving Probabilistic Models in Text Classification...
dataverse.harvard.edu
Updated Aug 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mitchell Bosley; Saki Kuzushima; Ted Enamorado; Yuki Shiraito (2024). Replication Data for: Improving Probabilistic Models in Text Classification via Active Learning [Dataset]. http://doi.org/10.7910/DVN/7DOXQY
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/7DOXQY
Dataset updated
Aug 6, 2024
Dataset provided by
Harvard Dataverse
Authors
Mitchell Bosley; Saki Kuzushima; Ted Enamorado; Yuki Shiraito
License
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/7DOXQYhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/7DOXQY
Description
Social scientists often classify text documents to use the resulting labels as an outcome or a predictor in empirical research. Automated text classification has become a standard tool, since it requires less human coding. However, scholars still need many human-labeled documents for training. To reduce labeling costs, we propose a new algorithm for text classification that combines a probabilistic model with active learning. The probabilistic model uses both labeled and unlabeled data, and active learning concentrates labeling efforts on difficult documents to classify. Our validation study shows that with few labeled data the classification performance of our algorithm is comparable to state-of-the-art methods at a fraction of the computational cost. We replicate the results of two published articles with only a small fraction of the original labeled data used in those studies, and provide open-source software to implement our method.
Product Reviews for Ordinal Quantification
zenodo.org
data.europa.eu
zip
Updated Oct 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). Product Reviews for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8176791
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8176791
Dataset updated
Oct 4, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
Description
This data set comprises a labeled training set, validation samples, and testing samples for ordinal quantification. The goal of quantification is not to predict the class label of each individual instance, but the distribution of labels in unlabeled sets of data.

The data is extracted from the McAuley data set of product reviews in Amazon, where the goal is to predict the 5-star rating of each textual review. We have sampled this data according to three protocols that are suited for quantification research.

The first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ(50%), is a variant thereof, where only the smoothest 50% of all APP samples are considered. This variant is targeted at ordinal quantification, where classes are ordered and a similarity of neighboring classes can be assumed. 5-star ratings of product reviews lie on an ordinal scale and, hence, pose such an ordinal quantification task. The third protocol considers "real" distributions of labels. These distributions stem from actual products in the original data set.

The data is represented by a RoBERTa embedding. In our experience, logistic regression classifiers work well with this representation.

You can extract our data sets yourself, for instance, if you require a raw textual representation. The original McAuley data set is public already and we provide all of our extraction scripts.

Extraction scripts and experiments: https://github.com/mirkobunse/regularized-oq

Original data by McAuley: https://jmcauley.ucsd.edu/data/amazon/
A
‘BLE RSSI Dataset for Indoor localization’ analyzed by Analyst-2
analyst-2.ai
Updated Nov 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘BLE RSSI Dataset for Indoor localization’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-ble-rssi-dataset-for-indoor-localization-f7ec/641e5a0f/?iid=005-634&v=presentation
Explore at:
Dataset updated
Nov 21, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘BLE RSSI Dataset for Indoor localization’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/mehdimka/ble-rssi-dataset on 20 November 2021.

--- Dataset description provided by original source is as follows ---

Content

The dataset was created using the RSSI readings of an array of 13 ibeacons in the first floor of Waldo Library, Western Michigan University. Data was collected using iPhone 6S. The dataset contains two sub-datasets: a labeled dataset (1420 instances) and an unlabeled dataset (5191 instances). The recording was performed during the operational hours of the library. For the labeled dataset, the input data contains the location (label column), a timestamp, followed by RSSI readings of 13 iBeacons. RSSI measurements are negative values. Bigger RSSI values indicate closer proximity to a given iBeacon (e.g., RSSI of -65 represent a closer distance to a given iBeacon compared to RSSI of -85). For out-of-range iBeacons, the RSSI is indicated by -200. The locations related to RSSI readings are combined in one column consisting a letter for the column and a number for the row of the position. The following figure depicts the layout of the iBeacons as well as the arrange of locations.

https://www.kaggle.com/mehdimka/ble-rssi-dataset/downloads/iBeacon_Layout.jpg" alt="iBeacons Layout">

Attribute Information

location: The location of receiving RSSIs from ibeacons b3001 to b3013; symbolic values showing the column and row of the location on the map (e.g., A01 stands for column A, row 1).

date: Datetime in the format of ‘d-m-yyyy hh:mm:ss’

b3001 - b3013: RSSI readings corresponding to the iBeacons; numeric, integers only.

Acknowledgements

Provider: Mehdi Mohammadi and Ala Al-Fuqaha, {mehdi.mohammadi, ala-alfuqaha}@wmich.edu, Department of Computer Science, Western Michigan University

Citation Request:

M. Mohammadi, A. Al-Fuqaha, M. Guizani, J. Oh, “Semi-supervised Deep Reinforcement Learning in Support of IoT and Smart City Services,” IEEE Internet of Things Journal, Vol. PP, No. 99, 2017.

Inspiration

How unlabeled data can help for an improved learning system. How a GAN model can synthesizes viable paths based on the little labeled data and larger set of unlabeled data.

--- Original source retains full ownership of the source dataset ---
Data from: Trust and Believe – Should We? Evaluating the Trustworthiness of...
zenodo.org
bin, csv +2
Updated Aug 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tanveer Khan; Tanveer Khan (2022). Trust and Believe – Should We? Evaluating the Trustworthiness of Twitter Users [Dataset]. http://doi.org/10.5281/zenodo.6964059
Explore at:
text/x-python, txt, csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6964059
Dataset updated
Aug 22, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tanveer Khan; Tanveer Khan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Trust and Believe – Should We? Evaluating the Trustworthiness of Twitter Users

This model is used to analyze the Twitter users and assigns a score calculated based on their social profiles, the credibility of his tweets, the h-indexing score of the tweets. Users with a higher score are not only considered as more influential but also their tweets are considered to have greater credibility. The model is based on both the user level and content level features of a Twitter user. The details for feature extraction and calculating the Influence score is given in the paper.

Description
To extract the features from Twitter and generate the dataset we used Python. A modAL framework is used to randomly selects ambiguous data points from the unlabeled data pool using three different sampling techniques and the human manually annotates the selected data. We generate a dataset for 50000 Twitter users and then used different classifiers to classify the Twitter user either as Trusted or Untrusted.

Organization
The project consists of the following files:

Dataset.csv
The dataset consists of different features of 50000 Twitter users (Politicians) without labels.

Manually_labeled-Dataset.csv
This CSV file contains all those Twitter users classified manually as Trusted or Untrusted

feature_extraction.py
This python script is used to calculate the Influence score of a Twitter user and further used to generate a dataset. The Influence score is based on:

- Social reputation of the user
- Content score of the tweets
- Tweets credibility
- Index score for the number of re-tweets and likes

Activelearner.ipynb
To classify a large pool of unlabeled data, we used an active learning model (ModAL Framework). A semi-supervised learning algorithm ideal for a situation in which the unlabeled data is abundant but manual labeling is expensive. The active learner randomly selects ambiguous data points from the unlabeled data pool using three different sampling techniques and the human manually annotates the selected data. Further, we use four different classifiers (Support Vector Machine, Logistic Regression, Multilayer Perceptron and Random Forest) to classify the Twitter user as either Trusted Or Untrusted.

twitter_reputation.ipynb
We used different regression models to test its performance on our generated dataset (It is only for testing, now no more part of our work). We train and evaluate our models using different regression models.
Training and testing three regression models:
1. Multilayer perceptron
2. Deep neural network
3. Linear regression

twitter_credentials.py
In order to extract the features of Twitter users first, one need to authenticate by providing the credentials given in this file.

Screen names (Screen_name_1.txt, Screen_name_2.txt, Screen_name_3.txt)
These text files consist of all the Twitter user screen_names. All of them are politicians. We remove the names of all those politicians whose accounts are private. In addition, all those politicians who have no followers/followings are not on the list are also removed. The text of the tweets are not saved. Furthermore, we also remove duplicate names.

References
[1] https://stackoverflow.com/questions/38881314/twitter-data-to-csv-getting-error-when-trying-to-add-to-csv-file

[2] https://stackoverflow.com/questions/48157259/python-tweepy-api-user-timeline-for-list-of-multiple-users-error

[3] https://gallery.azure.ai/Notebook/Computing-Influence-Score-for-Twitter-Users-1

[4] https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

[5] https://towardsdatascience.com/deep-neural-networks-for-regression-problems-81321897ca33
Data used in Machine learning reveals the waggle drift's role in the honey...
data.europa.eu
zenodo.org
unknown
Updated Jul 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). Data used in Machine learning reveals the waggle drift's role in the honey bee dance communication system [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7928121?locale=ga
Explore at:
unknown(3662)Available download formats
Dataset updated
Jul 4, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and metadata used in "Machine learning reveals the waggle drift’s role in the honey bee dance communication system" All timestamps are given in ISO 8601 format. The following files are included: Berlin2019_waggle_phases.csv, Berlin2021_waggle_phases.csv Automatic individual detections of waggle phases during our recording periods in 2019 and 2021. timestamp: Date and time of the detection. cam_id: Camera ID (0: left side of the hive, 1: right side of the hive). x_median, y_median: Median position of the bee during the waggle phase (for 2019 given in millimeters after applying a homography, for 2021 in the original image coordinates). waggle_angle: Body orientation of the bee during the waggle phase in radians (0: oriented to the right, PI / 4: oriented upwards). Berlin2019_dances.csv Automatic detections of dance behavior during our recording period in 2019. dancer_id: Unique ID of the individual bee. dance_id: Unique ID of the dance. ts_from, ts_to: Date and time of the beginning and end of the dance. cam_id: Camera ID (0: left side of the hive, 1: right side of the hive). median_x, median_y: Median position of the individual during the dance. feeder_cam_id: ID of the feeder that the bee was detected at prior to the dance. Berlin2019_followers.csv Automatic detections of attendance and following behavior, corresponding to the dances in Berlin2019_dances.csv. dance_id: Unique ID of the dance being attended or followed. follower_id: Unique ID of the individual attending or following the dance. ts_from, ts_to: Date and time of the beginning and end of the interaction. label: “attendance” or “follower” cam_id: Camera ID (0: left side of the hive, 1: right side of the hive). Berlin2019_dances_with_manually_verified_times.csv A sample of dances from Berlin2019_dances.csv where the exact timestamps have been manually verified to correspond to the beginning of the first and last waggle phase down to a precision of ca. 166 ms (video material was recorded at 6 FPS). dance_id: Unique ID of the dance. dancer_id: Unique ID of the dancing individual. cam_id: Camera ID (0: left side of the hive, 1: right side of the hive). feeder_cam_id: ID of the feeder that the bee was detected at prior to the dance. dance_start, dance_end: Manually verified date and times of the beginning and end of the dance. Berlin2019_dance_classifier_labels.csv Manually annotated waggle phases or following behavior for our recording season in 2019 that was used to train the dancing and following classifier. Can be merged with the supplied individual detections. timestamp: Timestamp of the individual frame the behavior was observed in. frame_id: Unique ID of the video frame the behavior was observed in. bee_id: Unique ID of the individual bee. label: One of “nothing”, “waggle”, “follower” Berlin2019_dance_classifier_unlabeled.csv Additional unlabeled samples of timestamp and individual ID with the same format as Berlin2019_dance_classifier_labels.csv, but without a label. The data points have been sampled close to detections of our waggle phase classifier, so behaviors related to the waggle dance are likely overrepresented in that sample. Berlin2021_waggle_phase_classifier_labels.csv Manually annotated detections of our waggle phase detector (bb_wdd2) that were used to train the neural network filter (bb_wdd_filter) for the 2021 data. detection_id: Unique ID of the waggle phase. label: One of “waggle”, “activating”, “ventilating”, “trembling”, “other”. Where “waggle” denoted a waggle phase, “activating” is the shaking signal, “ventilating” is a bee fanning her wings. “trembling” denotes a tremble dance, but the distinction from the “other” class was often not clear, so “trembling” was merged into “other” for training. orientation: The body orientation of the bee that triggered the detection in radians (0: facing to the right, PI /4: facing up). metadata_path: Path to the individual detection in the same directory structure as created by the waggle dance detector. Berlin2021_waggle_phase_classifier_ground_truth.zip The output of the waggle dance detector (bb_wdd2) that corresponds to Berlin2021_waggle_phase_classifier_labels.csv and is used for training. The archive includes a directory structure as output by the bb_wdd2 and each directory includes the original image sequence that triggered the detection in an archive and the corresponding metadata. The training code supplied in bb_wdd_filter directly works with this directory structure. Berlin2019_tracks.zip Detections and tracks from the recording season in 2019 as produced by our tracking system. As the full data is several terabytes in size, we include the subset of our data here that is relevant for our publication which comprises over 46 million detections. We included tracks for all detected behaviors (dancing, following, attending) including one minute before and after the behavior. We also included all tracks that correspond to the labeled and unlabeled data that was used to train the
Amos: A large-scale abdominal multi-organ benchmark for versatile medical...
zenodo.org
csv, zip
Updated Nov 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
YuanfengJi; YuanfengJi (2022). Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation (Unlabeled Data Part I) [Dataset]. http://doi.org/10.5281/zenodo.7262757
Explore at:
zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7262757
Dataset updated
Nov 7, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
YuanfengJi; YuanfengJi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans in recent years, a comprehensive evaluation of the models' capabilities is hampered by the lack of a large-scale benchmark from diverse clinical scenarios. Constraint by the high cost of collecting and labeling 3D medical data, most of the deep learning models to date are driven by datasets with a limited number of organs of interest or samples, which still limits the power of modern deep models and makes it difficult to provide a fully comprehensive and fair estimate of various methods. To mitigate the limitations, we present AMOS, a large-scale, diverse, clinical dataset for abdominal organ segmentation. AMOS provides 500 CT and 100 MRI scans collected from multi-center, multi-vendor, multi-modality, multi-phase, multi-disease patients, each with voxel-level annotations of 15 abdominal organs, providing challenging examples and test-bed for studying robust segmentation algorithms under diverse targets and scenarios. We further benchmark several state-of-the-art medical segmentation models to evaluate the status of the existing methods on this new challenging dataset. We have made our datasets, benchmark servers, and baselines publicly available, and hope to inspire future research. The paper can be found at https://arxiv.org/pdf/2206.08023.pdf

In addition to providing the labeled 600 CT and MRI scans, we expect to provide 2000 CT and 1200 MRI scans without labels to support more learning tasks (semi-supervised, un-supervised, domain adaption, ...). The link can be found in:

labeled data (500CT+100MRI)

unlabeled data Part I (900CT)

unlabeled data Part II (1100CT) (Now there are 1000CT, we will replenish to 1100CT)

unlabeled data Part III (1200MRI)

if you found this dataset useful for your research, please cite:

@article{ji2022amos, title={AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation}, author={Ji, Yuanfeng and Bai, Haotian and Yang, Jie and Ge, Chongjian and Zhu, Ye and Zhang, Ruimao and Li, Zhen and Zhang, Lingyan and Ma, Wanling and Wan, Xiang and others}, journal={arXiv preprint arXiv:2206.08023}, year={2022} }
Dataset: Data-Driven Machine Learning-Informed Framework for Model...
zenodo.org
csv
Updated May 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Edgar Amalyan; Edgar Amalyan (2025). Dataset: Data-Driven Machine Learning-Informed Framework for Model Predictive Control in Vehicles [Dataset]. http://doi.org/10.5281/zenodo.15288740
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15288740
Dataset updated
May 12, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Edgar Amalyan; Edgar Amalyan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset belonging to the paper: Data-Driven Machine Learning-Informed Framework for Model Predictive Control in Vehicles

labeled_seed.csv: Processed and labeled data of all maneuvers combined into a single file, sorted by label

raw_track_session.csv: Untouched CSV file from Racebox track session

unlabeled_exemplar.csv: Processed but unlabeled data of street and track data
h
510app_dataset
huggingface.co
Updated Dec 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryan Smith (2024). 510app_dataset [Dataset]. https://huggingface.co/datasets/RyanS974/510app_dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 10, 2024
Authors
Ryan Smith
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for Job Fair Candidates Classification Dataset

A supervised learning dataset for multi-label classification in tech industry hiring, focusing on candidate evaluation and salary prediction.

Dataset Details Dataset Description

A specialized dataset created for supervised learning tasks in hiring prediction. The dataset contains candidate information with 7 features and 2 classification labels, derived from a larger unlabeled dataset. This dataset… See the full description on the dataset page: https://huggingface.co/datasets/RyanS974/510app_dataset.
R
AI in Semi-supervised Learning Market Market Research Report 2033
researchintelo.com
csv, pdf, pptx
Updated Jul 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Research Intelo (2025). AI in Semi-supervised Learning Market Market Research Report 2033 [Dataset]. https://researchintelo.com/report/ai-in-semi-supervised-learning-market-market
Explore at:
csv, pptx, pdfAvailable download formats
Dataset updated
Jul 24, 2025
Dataset authored and provided by
Research Intelo
License
https://researchintelo.com/privacy-and-policyhttps://researchintelo.com/privacy-and-policy
Time period covered
2024 - 2033
Area covered
Global
Description
AI in Semi-supervised Learning Market Outlook

According to our latest research, the AI in Semi-supervised Learning market size reached USD 1.82 billion in 2024 globally, driven by rapid advancements in artificial intelligence and machine learning applications across diverse industries. The market is expected to expand at a robust CAGR of 28.1% from 2025 to 2033, reaching a projected value of USD 17.17 billion by 2033. This exponential growth is primarily fueled by the increasing need for efficient data labeling, the proliferation of unstructured data, and the growing adoption of AI-driven solutions in both large enterprises and small and medium businesses. As per the latest research, the surging demand for automation, accuracy, and cost-efficiency in data processing is significantly accelerating the adoption of semi-supervised learning models worldwide.

One of the most significant growth factors for the AI in Semi-supervised Learning market is the explosive increase in data generation across industries such as healthcare, finance, retail, and automotive. Organizations are continually collecting vast amounts of structured and unstructured data, but the process of labeling this data for supervised learning remains time-consuming and expensive. Semi-supervised learning offers a compelling solution by leveraging small amounts of labeled data alongside large volumes of unlabeled data, thus reducing the dependency on extensive manual annotation. This approach not only accelerates the deployment of AI models but also enhances their accuracy and scalability, making it highly attractive for enterprises seeking to maximize the value of their data assets while minimizing operational costs.

Another critical driver propelling the growth of the AI in Semi-supervised Learning market is the increasing sophistication of AI algorithms and the integration of advanced technologies such as deep learning, natural language processing, and computer vision. These advancements have enabled semi-supervised learning models to achieve remarkable performance in complex tasks like image and speech recognition, medical diagnostics, and fraud detection. The ability to process and interpret vast datasets with minimal supervision is particularly valuable in sectors where labeled data is scarce or expensive to obtain. Furthermore, the ongoing investments in research and development by leading technology companies and academic institutions are fostering innovation, resulting in more robust and scalable semi-supervised learning frameworks that can be seamlessly integrated into enterprise workflows.

The proliferation of cloud computing and the increasing adoption of hybrid and multi-cloud environments are also contributing significantly to the expansion of the AI in Semi-supervised Learning market. Cloud-based deployment offers unparalleled scalability, flexibility, and cost-efficiency, allowing organizations of all sizes to access cutting-edge AI tools and infrastructure without the need for substantial upfront investments. This democratization of AI technology is empowering small and medium enterprises to leverage semi-supervised learning for competitive advantage, driving widespread adoption across regions and industries. Additionally, the emergence of AI-as-a-Service (AIaaS) platforms is further simplifying the integration and management of semi-supervised learning models, enabling businesses to accelerate their digital transformation initiatives and unlock new growth opportunities.

From a regional perspective, North America currently dominates the AI in Semi-supervised Learning market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The strong presence of leading AI vendors, robust technological infrastructure, and high investments in AI research and development are key factors driving market growth in these regions. Asia Pacific is expected to witness the fastest CAGR during the forecast period, fueled by rapid digitalization, expanding IT infrastructure, and increasing government initiatives to promote AI adoption. Meanwhile, Latin America and the Middle East & Africa are also showing promising growth potential, supported by rising awareness of AI benefits and growing investments in digital transformation projects across various sectors.

Component Analysis

The component segment of the AI in Semi-supervised Learning market is divided into software, hardware, and services, each playing a pivotal role in the adoption and implementation of semi-s
f
The comparison of the median of the binary classification measurement...
plos.figshare.com
bin
Updated Jun 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moein E. Samadi; Sandra Kiefer; Sebastian Johaness Fritsch; Johannes Bickenbach; Andreas Schuppert (2023). The comparison of the median of the binary classification measurement results on the synthetic data. [Dataset]. http://doi.org/10.1371/journal.pone.0274569.t004
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0274569.t004
Dataset updated
Jun 13, 2023
Dataset provided by
PLOS ONE
Authors
Moein E. Samadi; Sandra Kiefer; Sebastian Johaness Fritsch; Johannes Bickenbach; Andreas Schuppert
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The comparison of the median of the binary classification measurement results on the synthetic data.
Data from: The COUGHVID crowdsourcing dataset: A corpus for the study of...
zenodo.org
ekoizpen-zientifikoa.ehu.eus
zip
Updated Aug 28, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lara Orlandic; Lara Orlandic; Tomas Teijeiro; Tomas Teijeiro; David Atienza; David Atienza (2022). The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms [Dataset]. http://doi.org/10.5281/zenodo.7024894
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7024894
Dataset updated
Aug 28, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lara Orlandic; Lara Orlandic; Tomas Teijeiro; Tomas Teijeiro; David Atienza; David Atienza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

Cough audio signal classification has been successfully used to diagnose a variety of respiratory conditions, and there has been significant interest in leveraging Machine Learning (ML) to provide widespread COVID-19 screening. The COUGHVID dataset provides over 30,000 crowdsourced cough recordings representing a wide range of subject ages, genders, geographic locations, and COVID-19 statuses. Furthermore, experienced pulmonologists labeled more than 2,000 recordings to diagnose medical abnormalities present in the coughs, thereby contributing one of the largest expert-labeled cough datasets in existence that can be used for a plethora of cough audio classification tasks. As a result, the COUGHVID dataset contributes a wealth of cough recordings for training ML models to address the world’s most urgent health crises.

Private Set and Testing Protocol

Researchers interested in testing their models on the private test dataset should contact us at coughvid@epfl.ch, briefly explaining the type of validation they wish to make, and their obtained results obtained through cross-validation with the public data. Then, access to the unlabeled recordings will be provided, and the researchers should send the predictions of their models on these recordings. Finally, the performance metrics of the predictions will be sent to the researchers. The private testing data is not included in any file within our Zenodo record, and it can only be accessed by contacting the COUGHVID team at the aforementioned e-mail address.

New Semi-Supervised Labeling

The third version of the COUGHVID dataset contains thousands of additional recordings obtained through October 2021. Additionally, the recordings containing coughs were re-labeled according to a semi-supervised learning algorithm that combined the user labels with those of the expert physicians, which were modeled using ML and expanded on the previously unlabeled data. These labels can be found in the "status_SSL" column of the "metadata_compiled.csv" file.
d
Data and code from: Learning a deep language model for microbiomes: The...
dataone.org
search.dataone.org
+2more
Updated Aug 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Quintin Pope; Rohan Varma; Christine Tataru; Maude David; Xiaoli Fern (2025). Data and code from: Learning a deep language model for microbiomes: The power of large scale unlabeled microbiome data [Dataset]. http://doi.org/10.5061/dryad.tb2rbp08p
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.tb2rbp08p
Dataset updated
Aug 1, 2025
Dataset provided by
Dryad Digital Repository
Authors
Quintin Pope; Rohan Varma; Christine Tataru; Maude David; Xiaoli Fern
Description
We use open source human gut microbiome data to learn a microbial â€œlanguageâ€ model by adapting techniques from Natural Language Processing (NLP). Our microbial â€œlanguageâ€ model is trained in a self-supervised fashion (i.e., without additional external labels) to capture the interactions among different microbial species and the common compositional patterns in microbial communities. The learned model produces contextualized taxa representations that allow a single bacteria species to be represented differently according to the specific microbial environment it appears in. The model further provides a sample representation by collectively interpreting different bacteria species in the sample and their interactions as a whole. We show that, compared to baseline representations, our sample representation consistently leads to improved performance for multiple prediction tasks including predicting Irritable Bowel Disease (IBD) and diet patterns. Coupled with a simple ensemble strategy, it p..., No additional raw data was collected for this project. All inputs are available publicly. American Gut Project, Halfvarson, and Schirmer raw data are available from the NCBI database (accession numbers PRJEB11419, PRJEB18471, and PRJNA398089, respectively).Â We used the curated data produced by Tataru and David, 2020., , # Code and data for "Learning a deep language model for microbiomes: the power of large scale unlabeled microbiome data"

Data:

vocab_embeddings.npy

Fixed vocabulary embeddings produced from prior work: Decoding the language of microbiomes using word-embedding techniques, and applications in inflammatory bowel disease. Adapted from here.

microbiomedata.zip

Contains the labels and data for the three datasets used in this study. Specifically, it includes:

IBD_(test|train)*(512|otu).npy and IBD*(test|train)_labels.npy

halfvarson_(512_otu|otu).npy and halfvarson_IBD_labels.npy

schirmer_IBD_(512_otu|otu).npy and schirmer_IBD_labels.npy

(test|train)encodings_(512|1897).npy

The data are stored as n_samples x max_sample_size x 2 numpy arrays, containin...
R
AI in Unsupervised Learning Market Market Research Report 2033
researchintelo.com
csv, pdf, pptx
Updated Jul 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Research Intelo (2025). AI in Unsupervised Learning Market Market Research Report 2033 [Dataset]. https://researchintelo.com/report/ai-in-unsupervised-learning-market-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Jul 24, 2025
Dataset authored and provided by
Research Intelo
License
https://researchintelo.com/privacy-and-policyhttps://researchintelo.com/privacy-and-policy
Time period covered
2024 - 2033
Area covered
Global
Description
AI in Unsupervised Learning Market Outlook

According to our latest research, the AI in Unsupervised Learning market size reached USD 3.8 billion globally in 2024, demonstrating robust expansion as organizations increasingly leverage unsupervised techniques for extracting actionable insights from unlabelled data. The market is forecasted to grow at a CAGR of 28.2% from 2025 to 2033, propelling the industry to an estimated USD 36.7 billion by 2033. This remarkable growth trajectory is primarily fueled by the escalating adoption of artificial intelligence across diverse sectors, an exponential surge in data generation, and the pressing need for advanced analytics that can operate without manual data labeling.

One of the key growth factors driving the AI in Unsupervised Learning market is the rising complexity and volume of data generated by enterprises in the digital era. Organizations are inundated with unstructured and unlabelled data from sources such as social media, IoT devices, and transactional systems. Traditional supervised learning methods are often impractical due to the time and cost associated with manual labeling. Unsupervised learning algorithms, such as clustering and dimensionality reduction, offer a scalable solution by autonomously identifying patterns, anomalies, and hidden structures within vast datasets. This capability is increasingly vital for industries aiming to enhance decision-making, streamline operations, and gain a competitive edge through advanced analytics.

Another significant driver is the rapid advancement in computational power and AI infrastructure, which has made it feasible to implement sophisticated unsupervised learning models at scale. The proliferation of cloud computing and specialized AI hardware has reduced barriers to entry, enabling even small and medium enterprises to deploy unsupervised learning solutions. Additionally, the evolution of neural networks and deep learning architectures has expanded the scope of unsupervised algorithms, allowing for more complex tasks such as image recognition, natural language processing, and anomaly detection. These technological advancements are not only accelerating adoption but also fostering innovation across sectors including healthcare, finance, manufacturing, and retail.

Furthermore, regulatory compliance and the growing emphasis on data privacy are pushing organizations to adopt unsupervised learning methods. Unlike supervised approaches that require sensitive data labeling, unsupervised algorithms can process data without explicit human intervention, thereby reducing the risk of privacy breaches. This is particularly relevant in sectors such as healthcare and BFSI, where stringent data protection regulations are in place. The ability to derive insights from unlabelled data while maintaining compliance is a compelling value proposition, further propelling the market forward.

Regionally, North America continues to dominate the AI in Unsupervised Learning market owing to its advanced technological ecosystem, significant investments in AI research, and strong presence of leading market players. Europe follows closely, driven by robust regulatory frameworks and a focus on ethical AI deployment. The Asia Pacific region is exhibiting the fastest growth, fueled by rapid digital transformation, government initiatives, and increasing adoption of AI across industries. Latin America and the Middle East & Africa are also witnessing steady growth, albeit at a slower pace, as awareness and infrastructure continue to develop.

Component Analysis

The Component segment of the AI in Unsupervised Learning market is categorized into Software, Hardware, and Services, each playing a pivotal role in the overall ecosystem. The software segment, comprising machine learning frameworks, data analytics platforms, and AI development tools, holds the largest market share. This dominance is attributed to the continuous evolution of AI algorithms and the increasing availability of open-source and proprietary solutions tailored for unsupervised learning. Enterprises are investing heavily in software that can facilitate the seamless integration of unsupervised learning capabilities into existing workflows, enabling automation, predictive analytics, and pattern recognition without the need for labeled data.

The hardware segment, while smaller in comparison to software, is experiencing significant growth due to the escalating demand for high-perf
E
Hackathon - TF-TG literature triage unlabelled data
live.european-language-grid.eu
tsv
Updated Mar 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Hackathon - TF-TG literature triage unlabelled data [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7457
Explore at:
tsvAvailable download formats
Dataset updated
Mar 28, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Once literature triage system is ready it is time to actually try to apply if to records that do not have any label in order to find the subset that does describe TF-TG interactions (are relevant). This is the corpus that has to be labeled by the systems created (hopefully) during the hackathon. To make the results more useful we have pre-selected records that do mention TFs by exploiting either automatic human TF mention recognition or external references from databases that have manually curated information on transcription factors (from GeneRif or UniProt). This means that these abstracts should be enriched with TF relevant records. This record has the same format as the training data except that the last column with the class label is missing.It contains PMIDs and Abstracts.Name: greekc_triage_unlabelled_v01.tsvExample:Format: tsv-separated columns (PMID, PubAnnotation JSON formated results of Pubtator for this record together with the automatically detected gene mentions using GnormPlus providing the Entrez Gene Identifiers together with the mention offsets, i.e. start and end character positionsPubAnnotation format description: http://www.pubannotation.org/docs/annotation-format/PubTator record retrieval description:https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/curl.htmlWarning: This file is quite big!

Facebook

Twitter

Click to copy link

Link copied

Cite

Data Diggers (2023). Unlabelled dataset [Dataset]. https://www.kaggle.com/datasets/ahmedaliraja/unlabelled-dataset

Unlabelled dataset

Unlabeled Dataset: Exploring Uncharted Data Territories

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 29, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Data Diggers

Description

This dataset consists of unlabeled data representing various data points collected from different sources and domains. The dataset serves as a blank canvas for unsupervised learning experiments, allowing for the exploration of patterns, clusters, and hidden insights through various data analysis techniques. Researchers and data enthusiasts can use this dataset to develop and test unsupervised learning algorithms, identify underlying structures, and gain a deeper understanding of data without predefined labels.

Clear search

Close search

Google apps

Main menu

Unlabelled dataset

STL10-Labeled Image Recognition Dataset

UCI and OpenML Data Sets for Ordinal Quantification

Average dice coefficients of the few-supervised learning models using 2%,...

Active Evaluation Software for Selection of Ground Truth Labels

Classification results of the hybrid model on the synthetic data.

Replication Data for: Improving Probabilistic Models in Text Classification...

Product Reviews for Ordinal Quantification

‘BLE RSSI Dataset for Indoor localization’ analyzed by Analyst-2

Content

Attribute Information

Acknowledgements

Inspiration

How unlabeled data can help for an improved learning system. How a GAN model can synthesizes viable paths based on the little labeled data and larger set of unlabeled data.

Data from: Trust and Believe – Should We? Evaluating the Trustworthiness of...

Data used in Machine learning reveals the waggle drift's role in the honey...

Amos: A large-scale abdominal multi-organ benchmark for versatile medical...

Dataset: Data-Driven Machine Learning-Informed Framework for Model...

510app_dataset

AI in Semi-supervised Learning Market Market Research Report 2033

AI in Semi-supervised Learning Market Outlook

Component Analysis

The comparison of the median of the binary classification measurement...

Data from: The COUGHVID crowdsourcing dataset: A corpus for the study of...

Data and code from: Learning a deep language model for microbiomes: The...

Data:

AI in Unsupervised Learning Market Market Research Report 2033

AI in Unsupervised Learning Market Outlook

Component Analysis

Hackathon - TF-TG literature triage unlabelled data

Unlabelled dataset

Unlabeled Dataset: Exploring Uncharted Data Territories