This dataset consists of unlabeled data representing various data points collected from different sources and domains. The dataset serves as a blank canvas for unsupervised learning experiments, allowing for the exploration of patterns, clusters, and hidden insights through various data analysis techniques. Researchers and data enthusiasts can use this dataset to develop and test unsupervised learning algorithms, identify underlying structures, and gain a deeper understanding of data without predefined labels.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
## Overview
Unlabeled is a dataset for object detection tasks - it contains Face annotations for 2,928 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Objects2022 Unlabeled is a dataset for object detection tasks - it contains Household Objects annotations for 727 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains data related to the experiment conducted in the paper Towards the Systematic Testing of Virtual Reality Programs.
It contains an implementation of an approach for predicting defect proneness on unlabeled datasets- Average Clustering and Labeling (ACL).
ACL models get good prediction performance and are comparable to typical supervised learning models in terms of F-measure. ACL offers a viable choice for defect prediction on unlabeled dataset.
This dataset also contains analyzes related to code smells on C# repositories. Please check the paper to get futher information.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository list all the available repositories, to load the unlabeled Sentinel 2 (S2) L2A dataset used in the article "Self-Supervised Spatio-Temporal Representation Learning Of Satellite Image Time Series". This dataset is composed of patch time series acquired over France. For further details, see section IV.A of the pre-print article, available here. Each patch is constituted of the 10 bands [B2,B3,B4,B5,B6,B7,B8,B8A,B11,B12] and the three masks ['CLM_R1', 'EDG_R1', 'SAT_R1']. The global dataset is composed of two disjoint datasets: training (9 tiles) and validation dataset (4 tiles).
The validation dataset is available here : 10.5281/zenodo.7890452
The training dataset is composed of 9 zenodo repositories, one for each S2 tiles. Here are the available repositories:
T31UEP 10.5281/zenodo.7899943
T31TGJ 10.5281/zenodo.7899237
T30TYS 10.5281/zenodo.7924193
T31TFN 10.5281/zenodo.7896621
T31TDL 10.5281/zenodo.7896082
T31TDJ 10.5281/zenodo.7895498
T30UVU 10.5281/zenodo.7892410
T30TYQ 10.5281/zenodo.7890542
T30TXT 10.5281/zenodo.7875977
Dataset name S2 tiles ROI size Temporal extent
Train
T30TXT,T30TYQ,T30TYS,T30UVU,
T31TDJ,T31TDL,T31TFN,T31TGJ,T31UEP
1024*1024 2018-2020
Val T30TYR,T30UWU,T31TEK,T31UER 256*256 2016-2019
koyealagbe/nnces-unlabeled dataset hosted on Hugging Face and contributed by the HF Datasets community
oliveirabruno01/shaped-svgs-small-unlabeled-900 dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by ArbaazKhan3
Released under Apache 2.0
taylor-joren/peer-unlabeled dataset hosted on Hugging Face and contributed by the HF Datasets community
This dataset was created by ifeomaozo12
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Training and execution times (in seconds) of considered classifiers on the original collected dataset.
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/1XXDMWhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/1XXDMW
Identifying important policy outputs has long been of interest to political scientists. In this work, we propose a novel approach to the classification of policies. Instead of obtaining and aggregating expert evaluations of significance for a finite set of policy outputs, we use experts to identify a small set of significant outputs and then employ positive unlabeled (PU) learning to search for other similar examples in a large unlabeled set. We further propose to automate the first step by harvesting ‘seed’ sets of significant outputs from web data. We offer an application of the new approach by classifying over 9,000 government regulations in the United Kingdom. The obtained estimates are successfully validated against human experts, by forecasting web citations, and with a construct validity test.
Unlabeled Social Stories Dataset
This dataset contains high-quality social stories generated by different LLMs aimed at supporting children with special needs.
Citation
If you use this dataset, please cite: @misc{li2025socialstories, title = {Unlabeled Dataset}, author = {Wen Li}, year = {2025}, howpublished = {\url{https://huggingface.co/datasets/yirruli/Unlabeled_Dataset}}, note = {Accessed: [date]} }
PDAP/unlabeled-urls dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Classification performance of considered classifiers on the artificially balanced dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Clustering is an unsupervised machine learning technique whose goal is to cluster unlabeled data. But traditional clustering methods only output a set of results and do not provide any explanations of the results. Although in the literature a number of methods based on decision tree have been proposed to explain the clustering results, most of them have some disadvantages, such as too many branches and too deep leaves, which lead to complex explanations and make it difficult for users to understand. In this paper, a hypercube overlay model based on multi-objective optimization is proposed to achieve succinct explanations of clustering results. The model designs two objective functions based on the number of hypercubes and the compactness of instances and then uses multi-objective optimization to find a set of nondominated solutions. Finally, an Utopia point is defined to determine the most suitable solution, in which each cluster can be covered by as few hypercubes as possible. Based on these hypercubes, an explanations of each cluster is provided. Upon verification on synthetic and real datasets respectively, it shows that the model can provide a concise and understandable explanations to users.
These datasets were used while writing the following work:
Polo, F. M., Ciochetti, I., and Bertolo, E. (2021). Predicting legal proceedings status: approaches based on sequential text data. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, pages 264–265.
Please cite us if you use our datasets in your academic work:
@inproceedings{polo2021predicting,
title={Predicting legal proceedings status: approaches based on sequential text data},
author={Polo, Felipe Maia and Ciochetti, Itamar and Bertolo, Emerson},
booktitle={Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law},
pages={264--265},
year={2021}
}
More details below!
Every legal proceeding in Brazil is one of three possible classes of status: (i) archived proceedings, (ii) active proceedings, and (iii) suspended proceedings. The three possible classes are given in a specific instant in time, which may be temporary or permanent. Moreover, they are decided by the courts to organize their workflow, which in Brazil may reach thousands of simultaneous cases per judge. Developing machine learning models to classify legal proceedings according to their status can assist public and private institutions in managing large portfolios of legal proceedings, providing gains in scale and efficiency.
In this dataset, each proceeding is made up of a sequence of short texts called “motions” written in Portuguese by the courts’ administrative staff. The motions relate to the proceedings, but not necessarily to their legal status.
Our data is composed of two datasets: a dataset of ~3*10^6 unlabeled motions and a dataset containing 6449 legal proceedings, each with an individual and a variable number of motions, but which have been labeled by lawyers. Among the labeled data, 47.14% is classified as archived (class 1), 45.23% is classified as active (class 2), and 7.63% is classified as suspended (class 3).
The datasets we use are representative samples from the first (São Paulo) and third (Rio de Janeiro) most significant state courts. State courts handle the most variable types of cases throughout Brazil and are responsible for 80% of the total amount of lawsuits. Therefore, these datasets are a good representation of a very significant portion of the use of language and expressions in Brazilian legal vocabulary.
Regarding the labels dataset, the key "-1" denotes the most recent text while "-2" the second most recent and so on.
We would like to thank Ana Carolina Domingues Borges, Andrews Adriani Angeli, and Nathália Caroline Juarez Delgado from Tikal Tech for helping us to obtain the datasets. This work would not be possible without their efforts.
Can you develop good machine learning classifiers for text sequences? :)
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This public dataset contains labels for the unlabeled 100,000 pictures in the STL-10 dataset.
The dataset is human labeled with AI aid through Etiqueta, the one and only gamified mobile data labeling application.
stl10.py
is a python script written by Martin Tutek to download the complete STL10 dataset.
labels.json
contains labels for the 100,000 previously unlabeled images in the STL10 dataset
legend.json
is a mapping of the labels used.
stats.ipynb
presents a few statistics regarding the 100,000 newly labeled images.
If you use this dataset in your research please cite the following:
@techreport{yagli2025etiqueta,
author = {Semih Yagli},
title = {Etiqueta: AI-Aided, Gamified Data Labeling to Label and Segment Data},
year = {2025},
number = {TR-2025-0001},
address = {NJ, USA},
month = Apr.,
url = {https://www.aidatalabel.com/technical_reports/aidatalabel_tr_2025_0001.pdf},
institution = {AI Data Label},
}
@inproceedings{coates2011analysis,
title = {An analysis of single-layer networks in unsupervised feature learning},
author = {Coates, Adam and Ng, Andrew and Lee, Honglak},
booktitle = {Proceedings of the fourteenth international conference on artificial intelligence and statistics},
pages = {215--223},
year = {2011},
organization = {JMLR Workshop and Conference Proceedings}
}
Note: The dataset is imported to Kaggle from: https://github.com/semihyagli/STL10-Labeled See also: https://github.com/semihyagli/STL10_Segmentation
If you have comments and questions about Etiqueta or about this dataset, please reach us out at contact@aidatalabel.com
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by vialactea
Released under MIT
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by kevinzb56
Released under Apache 2.0
This dataset consists of unlabeled data representing various data points collected from different sources and domains. The dataset serves as a blank canvas for unsupervised learning experiments, allowing for the exploration of patterns, clusters, and hidden insights through various data analysis techniques. Researchers and data enthusiasts can use this dataset to develop and test unsupervised learning algorithms, identify underlying structures, and gain a deeper understanding of data without predefined labels.