This dataset consists of unlabeled data representing various data points collected from different sources and domains. The dataset serves as a blank canvas for unsupervised learning experiments, allowing for the exploration of patterns, clusters, and hidden insights through various data analysis techniques. Researchers and data enthusiasts can use this dataset to develop and test unsupervised learning algorithms, identify underlying structures, and gain a deeper understanding of data without predefined labels.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
## Overview
Unlabeled is a dataset for object detection tasks - it contains Face annotations for 2,928 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).
These datasets were used while writing the following work:
Polo, F. M., Ciochetti, I., and Bertolo, E. (2021). Predicting legal proceedings status: approaches based on sequential text data. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, pages 264–265.
Please cite us if you use our datasets in your academic work:
@inproceedings{polo2021predicting,
title={Predicting legal proceedings status: approaches based on sequential text data},
author={Polo, Felipe Maia and Ciochetti, Itamar and Bertolo, Emerson},
booktitle={Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law},
pages={264--265},
year={2021}
}
More details below!
Every legal proceeding in Brazil is one of three possible classes of status: (i) archived proceedings, (ii) active proceedings, and (iii) suspended proceedings. The three possible classes are given in a specific instant in time, which may be temporary or permanent. Moreover, they are decided by the courts to organize their workflow, which in Brazil may reach thousands of simultaneous cases per judge. Developing machine learning models to classify legal proceedings according to their status can assist public and private institutions in managing large portfolios of legal proceedings, providing gains in scale and efficiency.
In this dataset, each proceeding is made up of a sequence of short texts called “motions” written in Portuguese by the courts’ administrative staff. The motions relate to the proceedings, but not necessarily to their legal status.
Our data is composed of two datasets: a dataset of ~3*10^6 unlabeled motions and a dataset containing 6449 legal proceedings, each with an individual and a variable number of motions, but which have been labeled by lawyers. Among the labeled data, 47.14% is classified as archived (class 1), 45.23% is classified as active (class 2), and 7.63% is classified as suspended (class 3).
The datasets we use are representative samples from the first (São Paulo) and third (Rio de Janeiro) most significant state courts. State courts handle the most variable types of cases throughout Brazil and are responsible for 80% of the total amount of lawsuits. Therefore, these datasets are a good representation of a very significant portion of the use of language and expressions in Brazilian legal vocabulary.
Regarding the labels dataset, the key "-1" denotes the most recent text while "-2" the second most recent and so on.
We would like to thank Ana Carolina Domingues Borges, Andrews Adriani Angeli, and Nathália Caroline Juarez Delgado from Tikal Tech for helping us to obtain the datasets. This work would not be possible without their efforts.
Can you develop good machine learning classifiers for text sequences? :)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains data related to the experiment conducted in the paper Towards the Systematic Testing of Virtual Reality Programs.
It contains an implementation of an approach for predicting defect proneness on unlabeled datasets- Average Clustering and Labeling (ACL).
ACL models get good prediction performance and are comparable to typical supervised learning models in terms of F-measure. ACL offers a viable choice for defect prediction on unlabeled dataset.
This dataset also contains analyzes related to code smells on C# repositories. Please check the paper to get futher information.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Average dice coefficients of the few-supervised learning models using 2%, 5%, and 10% of the labeled data, and semi-supervised learning models using 10% of the labeled data for training.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Machine learning techniques that rely on textual features or sentiment lexicons can lead to erroneous sentiment analysis. These techniques are especially vulnerable to domain-related difficulties, especially when dealing in Big data. In addition, labeling is time-consuming and supervised machine learning algorithms often lack labeled data. Transfer learning can help save time and obtain high performance with fewer datasets in this field. To cope this, we used a transfer learning-based Multi-Domain Sentiment Classification (MDSC) technique. We are able to identify the sentiment polarity of text in a target domain that is unlabeled by looking at reviews in a labelled source domain. This research aims to evaluate the impact of domain adaptation and measure the extent to which transfer learning enhances sentiment analysis outcomes. We employed transfer learning models BERT, RoBERTa, ELECTRA, and ULMFiT to improve the performance in sentiment analysis. We analyzed sentiment through various transformer models and compared the performance of LSTM and CNN. The experiments are carried on five publicly available sentiment analysis datasets, namely Hotel Reviews (HR), Movie Reviews (MR), Sentiment140 Tweets (ST), Citation Sentiment Corpus (CSC), and Bioinformatics Citation Corpus (BCC), to adapt multi-target domains. The performance of numerous models employing transfer learning from diverse datasets demonstrating how various factors influence the outputs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We study a class of general M-estimators in the semi-supervised setting, wherein the data are typically a combination of a relatively small labeled dataset and large amounts of unlabeled data. A new estimator, which efficiently uses the useful information contained in the unlabeled data, is proposed via a projection technique. We prove consistency and asymptotic normality, and provide an inference procedure based on K-fold cross-validation. The optimal weights are derived to balance the contributions of the labeled and unlabeled data. It is shown that the proposed method, by taking advantage of the unlabeled data, produces asymptotically more efficient estimation of the target parameters than the supervised counterpart. Supportive numerical evidence is shown in simulation studies. Applications are illustrated in analysis of the homeless data in Los Angeles. Supplementary materials for this article are available online.
https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
Contains Datasets for training and testing models for Urban Seismic Event Detection (USED).
The data is in SAC format, with JSON labels. The obspy library in python can be used to read this data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘BLE RSSI Dataset for Indoor localization’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/mehdimka/ble-rssi-dataset on 20 November 2021.
--- Dataset description provided by original source is as follows ---
The dataset was created using the RSSI readings of an array of 13 ibeacons in the first floor of Waldo Library, Western Michigan University. Data was collected using iPhone 6S. The dataset contains two sub-datasets: a labeled dataset (1420 instances) and an unlabeled dataset (5191 instances). The recording was performed during the operational hours of the library. For the labeled dataset, the input data contains the location (label column), a timestamp, followed by RSSI readings of 13 iBeacons. RSSI measurements are negative values. Bigger RSSI values indicate closer proximity to a given iBeacon (e.g., RSSI of -65 represent a closer distance to a given iBeacon compared to RSSI of -85). For out-of-range iBeacons, the RSSI is indicated by -200. The locations related to RSSI readings are combined in one column consisting a letter for the column and a number for the row of the position. The following figure depicts the layout of the iBeacons as well as the arrange of locations.
https://www.kaggle.com/mehdimka/ble-rssi-dataset/downloads/iBeacon_Layout.jpg" alt="iBeacons Layout">
Provider: Mehdi Mohammadi and Ala Al-Fuqaha, {mehdi.mohammadi, ala-alfuqaha}@wmich.edu, Department of Computer Science, Western Michigan University
Citation Request:
M. Mohammadi, A. Al-Fuqaha, M. Guizani, J. Oh, “Semi-supervised Deep Reinforcement Learning in Support of IoT and Smart City Services,” IEEE Internet of Things Journal, Vol. PP, No. 99, 2017.
--- Original source retains full ownership of the source dataset ---
The dataset used in the paper is a wide domain image dataset, and the authors propose a weakly semi-supervised method for disentangling using both labeled and unlabeled data.
The dataset used in the paper is a wide domain image dataset, and the authors propose a weakly semi-supervised method for disentangling using both labeled and unlabeled data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for Dataset Name
The SemiEvol dataset is part of the broader work on semi-supervised fine-tuning for Large Language Models (LLMs). The dataset includes labeled and unlabeled data splits designed to enhance the reasoning capabilities of LLMs through a bi-level knowledge propagation and selection framework, as proposed in the paper SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation.
Dataset Details
Dataset Sources [optional]… See the full description on the dataset page: https://huggingface.co/datasets/luojunyu/SemiEvol.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans in recent years, a comprehensive evaluation of the models' capabilities is hampered by the lack of a large-scale benchmark from diverse clinical scenarios. Constraint by the high cost of collecting and labeling 3D medical data, most of the deep learning models to date are driven by datasets with a limited number of organs of interest or samples, which still limits the power of modern deep models and makes it difficult to provide a fully comprehensive and fair estimate of various methods. To mitigate the limitations, we present AMOS, a large-scale, diverse, clinical dataset for abdominal organ segmentation. AMOS provides 500 CT and 100 MRI scans collected from multi-center, multi-vendor, multi-modality, multi-phase, multi-disease patients, each with voxel-level annotations of 15 abdominal organs, providing challenging examples and test-bed for studying robust segmentation algorithms under diverse targets and scenarios. We further benchmark several state-of-the-art medical segmentation models to evaluate the status of the existing methods on this new challenging dataset. We have made our datasets, benchmark servers, and baselines publicly available, and hope to inspire future research. The paper can be found at https://arxiv.org/pdf/2206.08023.pdf
In addition to providing the labeled 600 CT and MRI scans, we expect to provide 2000 CT and 1200 MRI scans without labels to support more learning tasks (semi-supervised, un-supervised, domain adaption, ...). The link can be found in:
if you found this dataset useful for your research, please cite:
@article{ji2022amos, title={AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation}, author={Ji, Yuanfeng and Bai, Haotian and Yang, Jie and Ge, Chongjian and Zhu, Ye and Zhang, Ruimao and Li, Zhen and Zhang, Lingyan and Ma, Wanling and Wan, Xiang and others}, journal={arXiv preprint arXiv:2206.08023}, year={2022}
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/7DOXQYhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/7DOXQY
Social scientists often classify text documents to use the resulting labels as an outcome or a predictor in empirical research. Automated text classification has become a standard tool, since it requires less human coding. However, scholars still need many human-labeled documents for training. To reduce labeling costs, we propose a new algorithm for text classification that combines a probabilistic model with active learning. The probabilistic model uses both labeled and unlabeled data, and active learning concentrates labeling efforts on difficult documents to classify. Our validation study shows that with few labeled data the classification performance of our algorithm is comparable to state-of-the-art methods at a fraction of the computational cost. We replicate the results of two published articles with only a small fraction of the original labeled data used in those studies, and provide open-source software to implement our method.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Number of images used for the training and testing of the models with different labeling strategies.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and metadata used in "Machine learning reveals the waggle drift’s role in the honey bee dance communication system"
All timestamps are given in ISO 8601 format.
The following files are included:
Berlin2019_waggle_phases.csv, Berlin2021_waggle_phases.csv
Automatic individual detections of waggle phases during our recording periods in 2019 and 2021.
timestamp: Date and time of the detection.
cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).
x_median, y_median: Median position of the bee during the waggle phase (for 2019 given in millimeters after applying a homography, for 2021 in the original image coordinates).
waggle_angle: Body orientation of the bee during the waggle phase in radians (0: oriented to the right, PI / 4: oriented upwards).
Berlin2019_dances.csv
Automatic detections of dance behavior during our recording period in 2019.
dancer_id: Unique ID of the individual bee.
dance_id: Unique ID of the dance.
ts_from, ts_to: Date and time of the beginning and end of the dance.
cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).
median_x, median_y: Median position of the individual during the dance.
feeder_cam_id: ID of the feeder that the bee was detected at prior to the dance.
Berlin2019_followers.csv
Automatic detections of attendance and following behavior, corresponding to the dances in Berlin2019_dances.csv.
dance_id: Unique ID of the dance being attended or followed.
follower_id: Unique ID of the individual attending or following the dance.
ts_from, ts_to: Date and time of the beginning and end of the interaction.
label: “attendance” or “follower”
cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).
Berlin2019_dances_with_manually_verified_times.csv
A sample of dances from Berlin2019_dances.csv where the exact timestamps have been manually verified to correspond to the beginning of the first and last waggle phase down to a precision of ca. 166 ms (video material was recorded at 6 FPS).
dance_id: Unique ID of the dance.
dancer_id: Unique ID of the dancing individual.
cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).
feeder_cam_id: ID of the feeder that the bee was detected at prior to the dance.
dance_start, dance_end: Manually verified date and times of the beginning and end of the dance.
Berlin2019_dance_classifier_labels.csv
Manually annotated waggle phases or following behavior for our recording season in 2019 that was used to train the dancing and following classifier. Can be merged with the supplied individual detections.
timestamp: Timestamp of the individual frame the behavior was observed in.
frame_id: Unique ID of the video frame the behavior was observed in.
bee_id: Unique ID of the individual bee.
label: One of “nothing”, “waggle”, “follower”
Berlin2019_dance_classifier_unlabeled.csv
Additional unlabeled samples of timestamp and individual ID with the same format as Berlin2019_dance_classifier_labels.csv, but without a label. The data points have been sampled close to detections of our waggle phase classifier, so behaviors related to the waggle dance are likely overrepresented in that sample.
Berlin2021_waggle_phase_classifier_labels.csv
Manually annotated detections of our waggle phase detector (bb_wdd2) that were used to train the neural network filter (bb_wdd_filter) for the 2021 data.
detection_id: Unique ID of the waggle phase.
label: One of “waggle”, “activating”, “ventilating”, “trembling”, “other”. Where “waggle” denoted a waggle phase, “activating” is the shaking signal, “ventilating” is a bee fanning her wings. “trembling” denotes a tremble dance, but the distinction from the “other” class was often not clear, so “trembling” was merged into “other” for training.
orientation: The body orientation of the bee that triggered the detection in radians (0: facing to the right, PI /4: facing up).
metadata_path: Path to the individual detection in the same directory structure as created by the waggle dance detector.
Berlin2021_waggle_phase_classifier_ground_truth.zip
The output of the waggle dance detector (bb_wdd2) that corresponds to Berlin2021_waggle_phase_classifier_labels.csv and is used for training. The archive includes a directory structure as output by the bb_wdd2 and each directory includes the original image sequence that triggered the detection in an archive and the corresponding metadata. The training code supplied in bb_wdd_filter directly works with this directory structure.
Berlin2019_tracks.zip
Detections and tracks from the recording season in 2019 as produced by our tracking system. As the full data is several terabytes in size, we include the subset of our data here that is relevant for our publication which comprises over 46 million detections. We included tracks for all detected behaviors (dancing, following, attending) including one minute before and after the behavior. We also included all tracks that correspond to the labeled and unlabeled data that was used to train the dance classifier including 30 seconds before and after the data used for training.
We grouped the exported data by date to make the handling easier, but to efficiently work with the data, we recommend importing it into an indexable database.
The individual files contain the following columns:
cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).
timestamp: Date and time of the detection.
frame_id: Unique ID of the video frame of the recording from which the detection was extracted.
track_id: Unique ID of an individual track (short motion path from one individual). For longer tracks, the detections can be linked based on the bee_id.
bee_id: Unique ID of the individual bee.
bee_id_confidence: Confidence between 0 and 1 that the bee_id is correct as output by our tracking system.
x_pos_hive, y_pos_hive: Spatial position of the bee in the hive on the side indicated by cam_id. Given in millimeters after applying a homography on the video material.
orientation_hive: Orientation of the bees’ thorax in the hive in radians (0: oriented to the right, PI / 4: oriented upwards).
Berlin2019_feeder_experiment_log.csv
Experiment log for our feeder experiments in 2019.
date: Date given in the format year-month-day.
feeder_cam_id: Numeric ID of the feeder.
coordinates: Longitude and latitude of the feeder. For feeders 1 and 2 this is only given once and held constant. Feeder 3 had varying locations.
time_opened, time_closed: Date and time when the feeder was set up or closed again.
sucrose_solution: Concentration of the sucrose solution given as sugar:water (in terms of weight). On days where feeder 3 was open, the other two feeders offered water without sugar.
Software used to acquire and analyze the data:
bb_pipeline_models: Pretrained localizer and decoder models for bb_pipeline
bb_behavior: Database interaction and data (pre)processing, feature extraction
bb_wdd2: Automatic detection and decoding of honey bee waggle dances
bb_wdd_filter: Machine learning model to improve the accuracy of the waggle dance detector
bb_dance_networks: Detection of dancing and following behavior from trajectories
Inspired by the CIFAR-10 dataset, STL-10 is an image recognition dataset for the development of unsupervised machine and feature learning as well as deep learning algorithms. Each class has fewer number of labeled training examples compared to CIFAR-10, and a large set of unlabeled samples is provided to learn image models prior to training the models. The primary challenge is to utilize the unlabeled data. With the higher resolution (96x96) of this dataset, it is expected that will be a more challenging benchmark to attain when developing such scalable unsupervised ML models.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
As of 2023, the global self-supervised learning market size is valued at approximately USD 1.5 billion and is expected to escalate to around USD 10.8 billion by 2032, reflecting a compound annual growth rate (CAGR) of 24.1% during the forecast period. This robust growth is driven by the increasing demand for advanced AI models that can learn from large volumes of unlabeled data, significantly reducing the dependency on labeled datasets, thereby making AI training more cost-effective and scalable.
The growth of the self-supervised learning market is fueled by several factors, one of which is the exponential increase in data generation. With the proliferation of digital devices, IoT technologies, and social media platforms, there is an unprecedented amount of data being created every second. Self-supervised learning models leverage this vast amount of unlabeled data to train themselves, making them particularly valuable in industries where data labeling is time-consuming and expensive. This capability is especially pertinent in fields like healthcare, finance, and retail, where the rapid analysis of extensive datasets can lead to significant advancements in predictive analytics and customer insights.
Another critical driver is the advancement in computational technologies that support more sophisticated machine learning models. The development of more powerful GPUs and cloud-based AI platforms has enabled the efficient training and deployment of self-supervised learning models. These technological advancements not only reduce the time required for training but also enhance the accuracy and performance of the models. Furthermore, the integration of self-supervised learning with other AI paradigms such as reinforcement learning and deep learning is opening new avenues for research and application, further propelling market growth.
The increasing adoption of AI across various industries is also a significant growth factor. Businesses are increasingly recognizing the potential of AI to optimize operations, enhance customer experiences, and drive innovation. Self-supervised learning, with its ability to make sense of large, unstructured datasets, is becoming a cornerstone of AI strategies across sectors. For instance, in the healthcare sector, self-supervised learning is being used to develop predictive models for disease diagnosis and treatment planning, while in the finance sector, it aids in fraud detection and risk management.
Regionally, North America is expected to dominate the self-supervised learning market, owing to the presence of leading technology companies and extensive R&D activities in AI. However, the Asia Pacific region is anticipated to witness the fastest growth during the forecast period, driven by rapid digital transformation, increasing investment in AI technologies, and supportive government initiatives. Europe also presents a significant market opportunity, with a strong focus on AI research and development, particularly in countries like Germany, the UK, and France.
The self-supervised learning market is segmented by component into software, hardware, and services. The software segment is expected to hold the largest market share, driven by the development and adoption of advanced AI algorithms and platforms. These software solutions are designed to leverage the vast amounts of unlabeled data available, making them highly valuable for various applications such as natural language processing, computer vision, and predictive analytics. Furthermore, continuous advancements in software capabilities, such as improved model training techniques and enhanced data preprocessing tools, are expected to fuel the growth of this segment.
The hardware segment, while smaller in comparison to software, is crucial for the efficient deployment of self-supervised learning models. This includes high-performance computing systems, GPUs, and specialized AI accelerators that provide the necessary computational power to train and run complex AI models. Innovations in hardware technology, such as the development of more energy-efficient and powerful processing units, are expected to drive growth in this segment. Additionally, the increasing adoption of edge computing devices that can perform AI tasks locally, thereby reducing latency and bandwidth usage, is also contributing to the expansion of the hardware segment.
Services are another vital component of the self-supervised learning market. This segment encompasses various professional services such as consulting, int
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
IITKGP_Fence dataset
Overview
The IITKGP_Fence dataset is designed for tasks related to fence-like occlusion detection, defocus blur, depth mapping, and object segmentation. The captured data vaies in scene composition, background defocus, and object occlusions. The dataset comprises both labeled and unlabeled data, as well as additional video and RGB-D data. The contains ground truth occlusion masks (GT) for the corresponding images. We created the ground truth… See the full description on the dataset page: https://huggingface.co/datasets/NeuroVizv0yaZ3R/IITKGP_Fence_dataset.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This public dataset contains labels for the unlabeled 100,000 pictures in the STL-10 dataset.
The dataset is human labeled with AI aid through Etiqueta, the one and only gamified mobile data labeling application.
stl10.py
is a python script written by Martin Tutek to download the complete STL10 dataset.
labels.json
contains labels for the 100,000 previously unlabeled images in the STL10 dataset
legend.json
is a mapping of the labels used.
stats.ipynb
presents a few statistics regarding the 100,000 newly labeled images.
If you use this dataset in your research please cite the following:
@techreport{yagli2025etiqueta,
author = {Semih Yagli},
title = {Etiqueta: AI-Aided, Gamified Data Labeling to Label and Segment Data},
year = {2025},
number = {TR-2025-0001},
address = {NJ, USA},
month = Apr.,
url = {https://www.aidatalabel.com/technical_reports/aidatalabel_tr_2025_0001.pdf},
institution = {AI Data Label},
}
@inproceedings{coates2011analysis,
title = {An analysis of single-layer networks in unsupervised feature learning},
author = {Coates, Adam and Ng, Andrew and Lee, Honglak},
booktitle = {Proceedings of the fourteenth international conference on artificial intelligence and statistics},
pages = {215--223},
year = {2011},
organization = {JMLR Workshop and Conference Proceedings}
}
Note: The dataset is imported to Kaggle from: https://github.com/semihyagli/STL10-Labeled See also: https://github.com/semihyagli/STL10_Segmentation
If you have comments and questions about Etiqueta or about this dataset, please reach us out at contact@aidatalabel.com
This dataset consists of unlabeled data representing various data points collected from different sources and domains. The dataset serves as a blank canvas for unsupervised learning experiments, allowing for the exploration of patterns, clusters, and hidden insights through various data analysis techniques. Researchers and data enthusiasts can use this dataset to develop and test unsupervised learning algorithms, identify underlying structures, and gain a deeper understanding of data without predefined labels.