100+ datasets found

Unlabelled dataset
kaggle.com
Updated Oct 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Diggers (2023). Unlabelled dataset [Dataset]. https://www.kaggle.com/datasets/ahmedaliraja/unlabelled-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 29, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Data Diggers
Description
This dataset consists of unlabeled data representing various data points collected from different sources and domains. The dataset serves as a blank canvas for unsupervised learning experiments, allowing for the exploration of patterns, clusters, and hidden insights through various data analysis techniques. Researchers and data enthusiasts can use this dataset to develop and test unsupervised learning algorithms, identify underlying structures, and gain a deeper understanding of data without predefined labels.
R
Unlabeled Dataset
universe.roboflow.com
zip
Updated Jul 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hasan Berat (2025). Unlabeled Dataset [Dataset]. https://universe.roboflow.com/hasan-berat-c5eeq/unlabeled
Explore at:
zipAvailable download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Hasan Berat
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Variables measured
Face Bounding Boxes
Description
Unlabeled

## Overview Unlabeled is a dataset for object detection tasks - it contains Face annotations for 2,928 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).
Brazilian Legal Proceedings
kaggle.com
Updated May 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Felipe Maia Polo (2021). Brazilian Legal Proceedings [Dataset]. https://www.kaggle.com/felipepolo/brazilian-legal-proceedings/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 14, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Felipe Maia Polo
Description
The Dataset

These datasets were used while writing the following work:

Polo, F. M., Ciochetti, I., and Bertolo, E. (2021). Predicting legal proceedings status: approaches based on sequential text data. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, pages 264–265.

Please cite us if you use our datasets in your academic work:

@inproceedings{polo2021predicting, title={Predicting legal proceedings status: approaches based on sequential text data}, author={Polo, Felipe Maia and Ciochetti, Itamar and Bertolo, Emerson}, booktitle={Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law}, pages={264--265}, year={2021} }

More details below!

Context

Every legal proceeding in Brazil is one of three possible classes of status: (i) archived proceedings, (ii) active proceedings, and (iii) suspended proceedings. The three possible classes are given in a specific instant in time, which may be temporary or permanent. Moreover, they are decided by the courts to organize their workflow, which in Brazil may reach thousands of simultaneous cases per judge. Developing machine learning models to classify legal proceedings according to their status can assist public and private institutions in managing large portfolios of legal proceedings, providing gains in scale and efficiency.

In this dataset, each proceeding is made up of a sequence of short texts called “motions” written in Portuguese by the courts’ administrative staff. The motions relate to the proceedings, but not necessarily to their legal status.

Content

Our data is composed of two datasets: a dataset of ~3*10^6 unlabeled motions and a dataset containing 6449 legal proceedings, each with an individual and a variable number of motions, but which have been labeled by lawyers. Among the labeled data, 47.14% is classified as archived (class 1), 45.23% is classified as active (class 2), and 7.63% is classified as suspended (class 3).

The datasets we use are representative samples from the first (São Paulo) and third (Rio de Janeiro) most significant state courts. State courts handle the most variable types of cases throughout Brazil and are responsible for 80% of the total amount of lawsuits. Therefore, these datasets are a good representation of a very significant portion of the use of language and expressions in Brazilian legal vocabulary.

Regarding the labels dataset, the key "-1" denotes the most recent text while "-2" the second most recent and so on.

Acknowledgements

We would like to thank Ana Carolina Domingues Borges, Andrews Adriani Angeli, and Nathália Caroline Juarez Delgado from Tikal Tech for helping us to obtain the datasets. This work would not be possible without their efforts.

Inspiration

Can you develop good machine learning classifiers for text sequences? :)
m
Dataset - Towards the Systematic Testing of Virtual Reality Programs
data.mendeley.com
Updated Sep 16, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stevão Andrade (2021). Dataset - Towards the Systematic Testing of Virtual Reality Programs [Dataset]. http://doi.org/10.17632/4myfs585s9.2
Explore at:
Unique identifier
https://doi.org/10.17632/4myfs585s9.2
Dataset updated
Sep 16, 2021
Authors
Stevão Andrade
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains data related to the experiment conducted in the paper Towards the Systematic Testing of Virtual Reality Programs.

It contains an implementation of an approach for predicting defect proneness on unlabeled datasets- Average Clustering and Labeling (ACL).

ACL models get good prediction performance and are comparable to typical supervised learning models in terms of F-measure. ACL offers a viable choice for defect prediction on unlabeled dataset.

This dataset also contains analyzes related to code smells on C# repositories. Please check the paper to get futher information.
f
Average dice coefficients of the few-supervised learning models using 2%,...
plos.figshare.com
xls
Updated Sep 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seung-Ah Lee; Hyun Su Kim; Ehwa Yang; Young Cheol Yoon; Ji Hyun Lee; Byung-Ok Choi; Jae-Hun Kim (2024). Average dice coefficients of the few-supervised learning models using 2%, 5%, and 10% of the labeled data, and semi-supervised learning models using 10% of the labeled data for training. [Dataset]. http://doi.org/10.1371/journal.pone.0310203.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310203.t002
Dataset updated
Sep 6, 2024
Dataset provided by
PLOS ONE
Authors
Seung-Ah Lee; Hyun Su Kim; Ehwa Yang; Young Cheol Yoon; Ji Hyun Lee; Byung-Ok Choi; Jae-Hun Kim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Average dice coefficients of the few-supervised learning models using 2%, 5%, and 10% of the labeled data, and semi-supervised learning models using 10% of the labeled data for training.
f
Sentiment140 tweet statistics.
plos.figshare.com
xls
Updated Apr 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maha Ijaz; Naveed Anwar; Mejdl Safran; Sultan Alfarhood; Tariq Sadad; Imran (2024). Sentiment140 tweet statistics. [Dataset]. http://doi.org/10.1371/journal.pone.0297028.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0297028.t004
Dataset updated
Apr 1, 2024
Dataset provided by
PLOS ONE
Authors
Maha Ijaz; Naveed Anwar; Mejdl Safran; Sultan Alfarhood; Tariq Sadad; Imran
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine learning techniques that rely on textual features or sentiment lexicons can lead to erroneous sentiment analysis. These techniques are especially vulnerable to domain-related difficulties, especially when dealing in Big data. In addition, labeling is time-consuming and supervised machine learning algorithms often lack labeled data. Transfer learning can help save time and obtain high performance with fewer datasets in this field. To cope this, we used a transfer learning-based Multi-Domain Sentiment Classification (MDSC) technique. We are able to identify the sentiment polarity of text in a target domain that is unlabeled by looking at reviews in a labelled source domain. This research aims to evaluate the impact of domain adaptation and measure the extent to which transfer learning enhances sentiment analysis outcomes. We employed transfer learning models BERT, RoBERTa, ELECTRA, and ULMFiT to improve the performance in sentiment analysis. We analyzed sentiment through various transformer models and compared the performance of LSTM and CNN. The experiments are carried on five publicly available sentiment analysis datasets, namely Hotel Reviews (HR), Movie Reviews (MR), Sentiment140 Tweets (ST), Citation Sentiment Corpus (CSC), and Bioinformatics Citation Corpus (BCC), to adapt multi-target domains. The performance of numerous models employing transfer learning from diverse datasets demonstrating how various factors influence the outputs.
f
Data from: A General M-estimation Theory in Semi-Supervised Framework
tandf.figshare.com
application/x-rar
Updated Feb 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shanshan Song; Yuanyuan Lin; Yong Zhou (2024). A General M-estimation Theory in Semi-Supervised Framework [Dataset]. http://doi.org/10.6084/m9.figshare.22191384.v1
Explore at:
application/x-rarAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22191384.v1
Dataset updated
Feb 13, 2024
Dataset provided by
Taylor & Francis
Authors
Shanshan Song; Yuanyuan Lin; Yong Zhou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We study a class of general M-estimators in the semi-supervised setting, wherein the data are typically a combination of a relatively small labeled dataset and large amounts of unlabeled data. A new estimator, which efficiently uses the useful information contained in the unlabeled data, is proposed via a projection technique. We prove consistency and asymptotic normality, and provide an inference procedure based on K-fold cross-validation. The optimal weights are derived to balance the contributions of the labeled and unlabeled data. It is shown that the proposed method, by taking advantage of the unlabeled data, produces asymptotically more efficient estimation of the target parameters than the supervised counterpart. Supportive numerical evidence is shown in simulation studies. Applications are illustrated in analysis of the homeless data in Los Angeles. Supplementary materials for this article are available online.
Synthetic and Unlabeled Dataset for Urban Seismic Event Detection (USED)
zenodo.org
zip
Updated Feb 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Parth Sagar Hasabnis; Parth Sagar Hasabnis; Yunyue Elita Li; Yunyue Elita Li; Yumin Zhao; Yumin Zhao; Alex Nilot Enhedelihai; Alex Nilot Enhedelihai; Gang Fang; Gang Fang (2024). Synthetic and Unlabeled Dataset for Urban Seismic Event Detection (USED) [Dataset]. http://doi.org/10.5281/zenodo.10724593
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10724593
Dataset updated
Feb 28, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Parth Sagar Hasabnis; Parth Sagar Hasabnis; Yunyue Elita Li; Yunyue Elita Li; Yumin Zhao; Yumin Zhao; Alex Nilot Enhedelihai; Alex Nilot Enhedelihai; Gang Fang; Gang Fang
License
https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
Description
Contains Datasets for training and testing models for Urban Seismic Event Detection (USED).

Strong Dataset: Contains Synthetic Data to be used for supervised learning

Unlabel Dataset: Contains unlabeled data to be used for semi supervised (or unsupervised) learning

Test Synth: Synthetic Dataset to evaluate models

Test Real: Small Real Dataset to evaluate models

The data is in SAC format, with JSON labels. The obspy library in python can be used to read this data.
A
‘BLE RSSI Dataset for Indoor localization’ analyzed by Analyst-2
analyst-2.ai
Updated Nov 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘BLE RSSI Dataset for Indoor localization’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-ble-rssi-dataset-for-indoor-localization-f7ec/641e5a0f/?iid=005-634&v=presentation
Explore at:
Dataset updated
Nov 21, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘BLE RSSI Dataset for Indoor localization’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/mehdimka/ble-rssi-dataset on 20 November 2021.

--- Dataset description provided by original source is as follows ---

Content

The dataset was created using the RSSI readings of an array of 13 ibeacons in the first floor of Waldo Library, Western Michigan University. Data was collected using iPhone 6S. The dataset contains two sub-datasets: a labeled dataset (1420 instances) and an unlabeled dataset (5191 instances). The recording was performed during the operational hours of the library. For the labeled dataset, the input data contains the location (label column), a timestamp, followed by RSSI readings of 13 iBeacons. RSSI measurements are negative values. Bigger RSSI values indicate closer proximity to a given iBeacon (e.g., RSSI of -65 represent a closer distance to a given iBeacon compared to RSSI of -85). For out-of-range iBeacons, the RSSI is indicated by -200. The locations related to RSSI readings are combined in one column consisting a letter for the column and a number for the row of the position. The following figure depicts the layout of the iBeacons as well as the arrange of locations.

https://www.kaggle.com/mehdimka/ble-rssi-dataset/downloads/iBeacon_Layout.jpg" alt="iBeacons Layout">

Attribute Information

location: The location of receiving RSSIs from ibeacons b3001 to b3013; symbolic values showing the column and row of the location on the map (e.g., A01 stands for column A, row 1).

date: Datetime in the format of ‘d-m-yyyy hh:mm:ss’

b3001 - b3013: RSSI readings corresponding to the iBeacons; numeric, integers only.

Acknowledgements

Provider: Mehdi Mohammadi and Ala Al-Fuqaha, {mehdi.mohammadi, ala-alfuqaha}@wmich.edu, Department of Computer Science, Western Michigan University

Citation Request:

M. Mohammadi, A. Al-Fuqaha, M. Guizani, J. Oh, “Semi-supervised Deep Reinforcement Learning in Support of IoT and Smart City Services,” IEEE Internet of Things Journal, Vol. PP, No. 99, 2017.

Inspiration

How unlabeled data can help for an improved learning system. How a GAN model can synthesizes viable paths based on the little labeled data and larger set of unlabeled data.

--- Original source retains full ownership of the source dataset ---
t
Square dataset - Dataset - LDM
service.tib.eu
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Square dataset - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/square-dataset
Explore at:
Dataset updated
Dec 2, 2024
Description
The dataset used in the paper is a wide domain image dataset, and the authors propose a weakly semi-supervised method for disentangling using both labeled and unlabeled data.
t
CAS-PEAL-R1 dataset - Dataset - LDM
service.tib.eu
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). CAS-PEAL-R1 dataset - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/cas-peal-r1-dataset
Explore at:
Dataset updated
Dec 2, 2024
Description
The dataset used in the paper is a wide domain image dataset, and the authors propose a weakly semi-supervised method for disentangling using both labeled and unlabeled data.
h
SemiEvol
huggingface.co
Updated Oct 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
junyu (2024). SemiEvol [Dataset]. https://huggingface.co/datasets/luojunyu/SemiEvol
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 22, 2024
Authors
junyu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for Dataset Name

The SemiEvol dataset is part of the broader work on semi-supervised fine-tuning for Large Language Models (LLMs). The dataset includes labeled and unlabeled data splits designed to enhance the reasoning capabilities of LLMs through a bi-level knowledge propagation and selection framework, as proposed in the paper SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation.

Dataset Details Dataset Sources [optional]… See the full description on the dataset page: https://huggingface.co/datasets/luojunyu/SemiEvol.
Amos: A large-scale abdominal multi-organ benchmark for versatile medical...
zenodo.org
csv, zip
Updated Nov 7, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
YuanfengJi; YuanfengJi (2022). Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation (Unlabeled Data Part II) [Dataset]. http://doi.org/10.5281/zenodo.7295661
Explore at:
zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7295661
Dataset updated
Nov 7, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
YuanfengJi; YuanfengJi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans in recent years, a comprehensive evaluation of the models' capabilities is hampered by the lack of a large-scale benchmark from diverse clinical scenarios. Constraint by the high cost of collecting and labeling 3D medical data, most of the deep learning models to date are driven by datasets with a limited number of organs of interest or samples, which still limits the power of modern deep models and makes it difficult to provide a fully comprehensive and fair estimate of various methods. To mitigate the limitations, we present AMOS, a large-scale, diverse, clinical dataset for abdominal organ segmentation. AMOS provides 500 CT and 100 MRI scans collected from multi-center, multi-vendor, multi-modality, multi-phase, multi-disease patients, each with voxel-level annotations of 15 abdominal organs, providing challenging examples and test-bed for studying robust segmentation algorithms under diverse targets and scenarios. We further benchmark several state-of-the-art medical segmentation models to evaluate the status of the existing methods on this new challenging dataset. We have made our datasets, benchmark servers, and baselines publicly available, and hope to inspire future research. The paper can be found at https://arxiv.org/pdf/2206.08023.pdf

In addition to providing the labeled 600 CT and MRI scans, we expect to provide 2000 CT and 1200 MRI scans without labels to support more learning tasks (semi-supervised, un-supervised, domain adaption, ...). The link can be found in:

labeled data (500CT+100MRI)

unlabeled data Part I (900CT)

unlabeled data Part II (1100CT) (Now there are 1000CT, we will replenish to 1100CT)

unlabeled data Part III (1200MRI)

if you found this dataset useful for your research, please cite:

@article{ji2022amos, title={AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation}, author={Ji, Yuanfeng and Bai, Haotian and Yang, Jie and Ge, Chongjian and Zhu, Ye and Zhang, Ruimao and Li, Zhen and Zhang, Lingyan and Ma, Wanling and Wan, Xiang and others}, journal={arXiv preprint arXiv:2206.08023}, year={2022}
H
Replication Data for: Improving Probabilistic Models in Text Classification...
dataverse.harvard.edu
Updated Aug 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mitchell Bosley; Saki Kuzushima; Ted Enamorado; Yuki Shiraito (2024). Replication Data for: Improving Probabilistic Models in Text Classification via Active Learning [Dataset]. http://doi.org/10.7910/DVN/7DOXQY
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/7DOXQY
Dataset updated
Aug 6, 2024
Dataset provided by
Harvard Dataverse
Authors
Mitchell Bosley; Saki Kuzushima; Ted Enamorado; Yuki Shiraito
License
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/7DOXQYhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/7DOXQY
Description
Social scientists often classify text documents to use the resulting labels as an outcome or a predictor in empirical research. Automated text classification has become a standard tool, since it requires less human coding. However, scholars still need many human-labeled documents for training. To reduce labeling costs, we propose a new algorithm for text classification that combines a probabilistic model with active learning. The probabilistic model uses both labeled and unlabeled data, and active learning concentrates labeling efforts on difficult documents to classify. Our validation study shows that with few labeled data the classification performance of our algorithm is comparable to state-of-the-art methods at a fraction of the computational cost. We replicate the results of two published articles with only a small fraction of the original labeled data used in those studies, and provide open-source software to implement our method.
f
Number of images used for the training and testing of the models with...
plos.figshare.com
xls
Updated Sep 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seung-Ah Lee; Hyun Su Kim; Ehwa Yang; Young Cheol Yoon; Ji Hyun Lee; Byung-Ok Choi; Jae-Hun Kim (2024). Number of images used for the training and testing of the models with different labeling strategies. [Dataset]. http://doi.org/10.1371/journal.pone.0310203.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310203.t001
Dataset updated
Sep 6, 2024
Dataset provided by
PLOS ONE
Authors
Seung-Ah Lee; Hyun Su Kim; Ehwa Yang; Young Cheol Yoon; Ji Hyun Lee; Byung-Ok Choi; Jae-Hun Kim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of images used for the training and testing of the models with different labeling strategies.
Data used in Machine learning reveals the waggle drift's role in the honey...
zenodo.org
data.europa.eu
csv, zip
Updated May 18, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David M Dormagen; David M Dormagen; Benjamin Wild; Benjamin Wild; Fernando Wario; Fernando Wario; Tim Landgraf; Tim Landgraf (2023). Data used in Machine learning reveals the waggle drift's role in the honey bee dance communication system [Dataset]. http://doi.org/10.5281/zenodo.7928121
Explore at:
csv, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7928121
Dataset updated
May 18, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
David M Dormagen; David M Dormagen; Benjamin Wild; Benjamin Wild; Fernando Wario; Fernando Wario; Tim Landgraf; Tim Landgraf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and metadata used in "Machine learning reveals the waggle drift’s role in the honey bee dance communication system"

All timestamps are given in ISO 8601 format.

The following files are included:

Berlin2019_waggle_phases.csv, Berlin2021_waggle_phases.csv

Automatic individual detections of waggle phases during our recording periods in 2019 and 2021.

timestamp: Date and time of the detection.

cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).

x_median, y_median: Median position of the bee during the waggle phase (for 2019 given in millimeters after applying a homography, for 2021 in the original image coordinates).

waggle_angle: Body orientation of the bee during the waggle phase in radians (0: oriented to the right, PI / 4: oriented upwards).

Berlin2019_dances.csv

Automatic detections of dance behavior during our recording period in 2019.

dancer_id: Unique ID of the individual bee.

dance_id: Unique ID of the dance.

ts_from, ts_to: Date and time of the beginning and end of the dance.

cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).

median_x, median_y: Median position of the individual during the dance.

feeder_cam_id: ID of the feeder that the bee was detected at prior to the dance.

Berlin2019_followers.csv

Automatic detections of attendance and following behavior, corresponding to the dances in Berlin2019_dances.csv.

dance_id: Unique ID of the dance being attended or followed.

follower_id: Unique ID of the individual attending or following the dance.

ts_from, ts_to: Date and time of the beginning and end of the interaction.

label: “attendance” or “follower”

cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).

Berlin2019_dances_with_manually_verified_times.csv

A sample of dances from Berlin2019_dances.csv where the exact timestamps have been manually verified to correspond to the beginning of the first and last waggle phase down to a precision of ca. 166 ms (video material was recorded at 6 FPS).

dance_id: Unique ID of the dance.

dancer_id: Unique ID of the dancing individual.

cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).

feeder_cam_id: ID of the feeder that the bee was detected at prior to the dance.

dance_start, dance_end: Manually verified date and times of the beginning and end of the dance.

Berlin2019_dance_classifier_labels.csv

Manually annotated waggle phases or following behavior for our recording season in 2019 that was used to train the dancing and following classifier. Can be merged with the supplied individual detections.

timestamp: Timestamp of the individual frame the behavior was observed in.

frame_id: Unique ID of the video frame the behavior was observed in.

bee_id: Unique ID of the individual bee.

label: One of “nothing”, “waggle”, “follower”

Berlin2019_dance_classifier_unlabeled.csv

Additional unlabeled samples of timestamp and individual ID with the same format as Berlin2019_dance_classifier_labels.csv, but without a label. The data points have been sampled close to detections of our waggle phase classifier, so behaviors related to the waggle dance are likely overrepresented in that sample.

Berlin2021_waggle_phase_classifier_labels.csv

Manually annotated detections of our waggle phase detector (bb_wdd2) that were used to train the neural network filter (bb_wdd_filter) for the 2021 data.

detection_id: Unique ID of the waggle phase.

label: One of “waggle”, “activating”, “ventilating”, “trembling”, “other”. Where “waggle” denoted a waggle phase, “activating” is the shaking signal, “ventilating” is a bee fanning her wings. “trembling” denotes a tremble dance, but the distinction from the “other” class was often not clear, so “trembling” was merged into “other” for training.

orientation: The body orientation of the bee that triggered the detection in radians (0: facing to the right, PI /4: facing up).

metadata_path: Path to the individual detection in the same directory structure as created by the waggle dance detector.

Berlin2021_waggle_phase_classifier_ground_truth.zip

The output of the waggle dance detector (bb_wdd2) that corresponds to Berlin2021_waggle_phase_classifier_labels.csv and is used for training. The archive includes a directory structure as output by the bb_wdd2 and each directory includes the original image sequence that triggered the detection in an archive and the corresponding metadata. The training code supplied in bb_wdd_filter directly works with this directory structure.

Berlin2019_tracks.zip

Detections and tracks from the recording season in 2019 as produced by our tracking system. As the full data is several terabytes in size, we include the subset of our data here that is relevant for our publication which comprises over 46 million detections. We included tracks for all detected behaviors (dancing, following, attending) including one minute before and after the behavior. We also included all tracks that correspond to the labeled and unlabeled data that was used to train the dance classifier including 30 seconds before and after the data used for training.
We grouped the exported data by date to make the handling easier, but to efficiently work with the data, we recommend importing it into an indexable database.

The individual files contain the following columns:

cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).

timestamp: Date and time of the detection.

frame_id: Unique ID of the video frame of the recording from which the detection was extracted.

track_id: Unique ID of an individual track (short motion path from one individual). For longer tracks, the detections can be linked based on the bee_id.

bee_id: Unique ID of the individual bee.

bee_id_confidence: Confidence between 0 and 1 that the bee_id is correct as output by our tracking system.

x_pos_hive, y_pos_hive: Spatial position of the bee in the hive on the side indicated by cam_id. Given in millimeters after applying a homography on the video material.

orientation_hive: Orientation of the bees’ thorax in the hive in radians (0: oriented to the right, PI / 4: oriented upwards).

Berlin2019_feeder_experiment_log.csv

Experiment log for our feeder experiments in 2019.

date: Date given in the format year-month-day.

feeder_cam_id: Numeric ID of the feeder.

coordinates: Longitude and latitude of the feeder. For feeders 1 and 2 this is only given once and held constant. Feeder 3 had varying locations.

time_opened, time_closed: Date and time when the feeder was set up or closed again.
sucrose_solution: Concentration of the sucrose solution given as sugar:water (in terms of weight). On days where feeder 3 was open, the other two feeders offered water without sugar.

Software used to acquire and analyze the data:

bb_pipeline: Tag localization and decoding pipeline

bb_pipeline_models: Pretrained localizer and decoder models for bb_pipeline

bb_binary: Raw detection data storage format

bb_irflash: IR flash system schematics and arduino code

bb_imgacquisition: Recording and network storage

bb_behavior: Database interaction and data (pre)processing, feature extraction

bb_tracking: Tracking of bee detections over time

bb_wdd2: Automatic detection and decoding of honey bee waggle dances

bb_wdd_filter: Machine learning model to improve the accuracy of the waggle dance detector

bb_dance_networks: Detection of dancing and following behavior from trajectories

STL-10

opendatalab.com

zip

Updated Aug 24, 2022

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

University of Michigan (2022). STL-10 [Dataset]. https://opendatalab.com/OpenDataLab/STL-10

Explore at:

zip(5978439104 bytes)Available download formats

Dataset updated

Aug 24, 2022

Dataset provided by

Stanford University
University of Michigan

Description

Inspired by the CIFAR-10 dataset, STL-10 is an image recognition dataset for the development of unsupervised machine and feature learning as well as deep learning algorithms. Each class has fewer number of labeled training examples compared to CIFAR-10, and a large set of unlabeled samples is provided to learn image models prior to training the models. The primary challenge is to utilize the unlabeled data. With the higher resolution (96x96) of this dataset, it is expected that will be a more challenging benchmark to attain when developing such scalable unsupervised ML models.

D
Self-Supervised Learning Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Sep 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2024). Self-Supervised Learning Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-self-supervised-learning-market
Explore at:
pdf, pptx, csvAvailable download formats
Dataset updated
Sep 23, 2024
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Self-Supervised Learning Market Outlook

As of 2023, the global self-supervised learning market size is valued at approximately USD 1.5 billion and is expected to escalate to around USD 10.8 billion by 2032, reflecting a compound annual growth rate (CAGR) of 24.1% during the forecast period. This robust growth is driven by the increasing demand for advanced AI models that can learn from large volumes of unlabeled data, significantly reducing the dependency on labeled datasets, thereby making AI training more cost-effective and scalable.

The growth of the self-supervised learning market is fueled by several factors, one of which is the exponential increase in data generation. With the proliferation of digital devices, IoT technologies, and social media platforms, there is an unprecedented amount of data being created every second. Self-supervised learning models leverage this vast amount of unlabeled data to train themselves, making them particularly valuable in industries where data labeling is time-consuming and expensive. This capability is especially pertinent in fields like healthcare, finance, and retail, where the rapid analysis of extensive datasets can lead to significant advancements in predictive analytics and customer insights.

Another critical driver is the advancement in computational technologies that support more sophisticated machine learning models. The development of more powerful GPUs and cloud-based AI platforms has enabled the efficient training and deployment of self-supervised learning models. These technological advancements not only reduce the time required for training but also enhance the accuracy and performance of the models. Furthermore, the integration of self-supervised learning with other AI paradigms such as reinforcement learning and deep learning is opening new avenues for research and application, further propelling market growth.

The increasing adoption of AI across various industries is also a significant growth factor. Businesses are increasingly recognizing the potential of AI to optimize operations, enhance customer experiences, and drive innovation. Self-supervised learning, with its ability to make sense of large, unstructured datasets, is becoming a cornerstone of AI strategies across sectors. For instance, in the healthcare sector, self-supervised learning is being used to develop predictive models for disease diagnosis and treatment planning, while in the finance sector, it aids in fraud detection and risk management.

Regionally, North America is expected to dominate the self-supervised learning market, owing to the presence of leading technology companies and extensive R&D activities in AI. However, the Asia Pacific region is anticipated to witness the fastest growth during the forecast period, driven by rapid digital transformation, increasing investment in AI technologies, and supportive government initiatives. Europe also presents a significant market opportunity, with a strong focus on AI research and development, particularly in countries like Germany, the UK, and France.

Component Analysis

The self-supervised learning market is segmented by component into software, hardware, and services. The software segment is expected to hold the largest market share, driven by the development and adoption of advanced AI algorithms and platforms. These software solutions are designed to leverage the vast amounts of unlabeled data available, making them highly valuable for various applications such as natural language processing, computer vision, and predictive analytics. Furthermore, continuous advancements in software capabilities, such as improved model training techniques and enhanced data preprocessing tools, are expected to fuel the growth of this segment.

The hardware segment, while smaller in comparison to software, is crucial for the efficient deployment of self-supervised learning models. This includes high-performance computing systems, GPUs, and specialized AI accelerators that provide the necessary computational power to train and run complex AI models. Innovations in hardware technology, such as the development of more energy-efficient and powerful processing units, are expected to drive growth in this segment. Additionally, the increasing adoption of edge computing devices that can perform AI tasks locally, thereby reducing latency and bandwidth usage, is also contributing to the expansion of the hardware segment.

Services are another vital component of the self-supervised learning market. This segment encompasses various professional services such as consulting, int
h
IITKGP_Fence_dataset
huggingface.co
Updated Sep 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moushumi Medhi (2024). IITKGP_Fence_dataset [Dataset]. https://huggingface.co/datasets/NeuroVizv0yaZ3R/IITKGP_Fence_dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 20, 2024
Authors
Moushumi Medhi
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
IITKGP_Fence dataset

Overview

The IITKGP_Fence dataset is designed for tasks related to fence-like occlusion detection, defocus blur, depth mapping, and object segmentation. The captured data vaies in scene composition, background defocus, and object occlusions. The dataset comprises both labeled and unlabeled data, as well as additional video and RGB-D data. The contains ground truth occlusion masks (GT) for the corresponding images. We created the ground truth… See the full description on the dataset page: https://huggingface.co/datasets/NeuroVizv0yaZ3R/IITKGP_Fence_dataset.
STL10-Labeled Image Recognition Dataset
kaggle.com
Updated Aug 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Semih Yagli (2025). STL10-Labeled Image Recognition Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/12688697
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/12688697
Dataset updated
Aug 6, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Semih Yagli
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This public dataset contains labels for the unlabeled 100,000 pictures in the STL-10 dataset.

The dataset is human labeled with AI aid through Etiqueta, the one and only gamified mobile data labeling application. stl10.py is a python script written by Martin Tutek to download the complete STL10 dataset. labels.json contains labels for the 100,000 previously unlabeled images in the STL10 dataset legend.json is a mapping of the labels used. stats.ipynb presents a few statistics regarding the 100,000 newly labeled images.

If you use this dataset in your research please cite the following:

@techreport{yagli2025etiqueta, author = {Semih Yagli}, title = {Etiqueta: AI-Aided, Gamified Data Labeling to Label and Segment Data}, year = {2025}, number = {TR-2025-0001}, address = {NJ, USA}, month = Apr., url = {https://www.aidatalabel.com/technical_reports/aidatalabel_tr_2025_0001.pdf}, institution = {AI Data Label}, } @inproceedings{coates2011analysis, title = {An analysis of single-layer networks in unsupervised feature learning}, author = {Coates, Adam and Ng, Andrew and Lee, Honglak}, booktitle = {Proceedings of the fourteenth international conference on artificial intelligence and statistics}, pages = {215--223}, year = {2011}, organization = {JMLR Workshop and Conference Proceedings} }

Note: The dataset is imported to Kaggle from: https://github.com/semihyagli/STL10-Labeled See also: https://github.com/semihyagli/STL10_Segmentation

If you have comments and questions about Etiqueta or about this dataset, please reach us out at contact@aidatalabel.com

Facebook

Twitter

Click to copy link

Link copied

Cite

Data Diggers (2023). Unlabelled dataset [Dataset]. https://www.kaggle.com/datasets/ahmedaliraja/unlabelled-dataset

Unlabelled dataset

Unlabeled Dataset: Exploring Uncharted Data Territories

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 29, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Data Diggers

Description

This dataset consists of unlabeled data representing various data points collected from different sources and domains. The dataset serves as a blank canvas for unsupervised learning experiments, allowing for the exploration of patterns, clusters, and hidden insights through various data analysis techniques. Researchers and data enthusiasts can use this dataset to develop and test unsupervised learning algorithms, identify underlying structures, and gain a deeper understanding of data without predefined labels.

Clear search

Close search

Google apps

Main menu

Unlabelled dataset

Unlabeled Dataset

Unlabeled

Brazilian Legal Proceedings

The Dataset

Context

Content

Acknowledgements

Inspiration

Dataset - Towards the Systematic Testing of Virtual Reality Programs

Average dice coefficients of the few-supervised learning models using 2%,...

Sentiment140 tweet statistics.

Data from: A General M-estimation Theory in Semi-Supervised Framework

Synthetic and Unlabeled Dataset for Urban Seismic Event Detection (USED)

‘BLE RSSI Dataset for Indoor localization’ analyzed by Analyst-2

Content

Attribute Information

Acknowledgements

Inspiration

How unlabeled data can help for an improved learning system. How a GAN model can synthesizes viable paths based on the little labeled data and larger set of unlabeled data.

Square dataset - Dataset - LDM

CAS-PEAL-R1 dataset - Dataset - LDM

SemiEvol

Amos: A large-scale abdominal multi-organ benchmark for versatile medical...

Replication Data for: Improving Probabilistic Models in Text Classification...

Number of images used for the training and testing of the models with...

Data used in Machine learning reveals the waggle drift's role in the honey...

STL-10

Self-Supervised Learning Market Report | Global Forecast From 2025 To 2033

Self-Supervised Learning Market Outlook

Component Analysis

IITKGP_Fence_dataset

STL10-Labeled Image Recognition Dataset

Unlabelled dataset

Unlabeled Dataset: Exploring Uncharted Data Territories