100+ datasets found
  1. Unlabelled dataset

    • kaggle.com
    Updated Oct 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Diggers (2023). Unlabelled dataset [Dataset]. https://www.kaggle.com/datasets/ahmedaliraja/unlabelled-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 29, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Data Diggers
    Description

    This dataset consists of unlabeled data representing various data points collected from different sources and domains. The dataset serves as a blank canvas for unsupervised learning experiments, allowing for the exploration of patterns, clusters, and hidden insights through various data analysis techniques. Researchers and data enthusiasts can use this dataset to develop and test unsupervised learning algorithms, identify underlying structures, and gain a deeper understanding of data without predefined labels.

  2. R

    Unlabeled Dataset

    • universe.roboflow.com
    zip
    Updated Jul 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hasan Berat (2025). Unlabeled Dataset [Dataset]. https://universe.roboflow.com/hasan-berat-c5eeq/unlabeled
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Hasan Berat
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Variables measured
    Face Bounding Boxes
    Description

    Unlabeled

    ## Overview
    
    Unlabeled is a dataset for object detection tasks - it contains Face annotations for 2,928 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).
    
  3. Brazilian Legal Proceedings

    • kaggle.com
    Updated May 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Felipe Maia Polo (2021). Brazilian Legal Proceedings [Dataset]. https://www.kaggle.com/felipepolo/brazilian-legal-proceedings/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 14, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Felipe Maia Polo
    Description

    The Dataset

    These datasets were used while writing the following work:

    Polo, F. M., Ciochetti, I., and Bertolo, E. (2021). Predicting legal proceedings status: approaches based on sequential text data. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, pages 264–265.
    

    Please cite us if you use our datasets in your academic work:

    @inproceedings{polo2021predicting,
     title={Predicting legal proceedings status: approaches based on sequential text data},
     author={Polo, Felipe Maia and Ciochetti, Itamar and Bertolo, Emerson},
     booktitle={Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law},
     pages={264--265},
     year={2021}
    }
    

    More details below!

    Context

    Every legal proceeding in Brazil is one of three possible classes of status: (i) archived proceedings, (ii) active proceedings, and (iii) suspended proceedings. The three possible classes are given in a specific instant in time, which may be temporary or permanent. Moreover, they are decided by the courts to organize their workflow, which in Brazil may reach thousands of simultaneous cases per judge. Developing machine learning models to classify legal proceedings according to their status can assist public and private institutions in managing large portfolios of legal proceedings, providing gains in scale and efficiency.

    In this dataset, each proceeding is made up of a sequence of short texts called “motions” written in Portuguese by the courts’ administrative staff. The motions relate to the proceedings, but not necessarily to their legal status.

    Content

    Our data is composed of two datasets: a dataset of ~3*10^6 unlabeled motions and a dataset containing 6449 legal proceedings, each with an individual and a variable number of motions, but which have been labeled by lawyers. Among the labeled data, 47.14% is classified as archived (class 1), 45.23% is classified as active (class 2), and 7.63% is classified as suspended (class 3).

    The datasets we use are representative samples from the first (São Paulo) and third (Rio de Janeiro) most significant state courts. State courts handle the most variable types of cases throughout Brazil and are responsible for 80% of the total amount of lawsuits. Therefore, these datasets are a good representation of a very significant portion of the use of language and expressions in Brazilian legal vocabulary.

    Regarding the labels dataset, the key "-1" denotes the most recent text while "-2" the second most recent and so on.

    Acknowledgements

    We would like to thank Ana Carolina Domingues Borges, Andrews Adriani Angeli, and Nathália Caroline Juarez Delgado from Tikal Tech for helping us to obtain the datasets. This work would not be possible without their efforts.

    Inspiration

    Can you develop good machine learning classifiers for text sequences? :)

  4. m

    Dataset - Towards the Systematic Testing of Virtual Reality Programs

    • data.mendeley.com
    Updated Sep 16, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stevão Andrade (2021). Dataset - Towards the Systematic Testing of Virtual Reality Programs [Dataset]. http://doi.org/10.17632/4myfs585s9.2
    Explore at:
    Dataset updated
    Sep 16, 2021
    Authors
    Stevão Andrade
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains data related to the experiment conducted in the paper Towards the Systematic Testing of Virtual Reality Programs.

    It contains an implementation of an approach for predicting defect proneness on unlabeled datasets- Average Clustering and Labeling (ACL).

    ACL models get good prediction performance and are comparable to typical supervised learning models in terms of F-measure. ACL offers a viable choice for defect prediction on unlabeled dataset.

    This dataset also contains analyzes related to code smells on C# repositories. Please check the paper to get futher information.

  5. f

    Average dice coefficients of the few-supervised learning models using 2%,...

    • plos.figshare.com
    xls
    Updated Sep 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seung-Ah Lee; Hyun Su Kim; Ehwa Yang; Young Cheol Yoon; Ji Hyun Lee; Byung-Ok Choi; Jae-Hun Kim (2024). Average dice coefficients of the few-supervised learning models using 2%, 5%, and 10% of the labeled data, and semi-supervised learning models using 10% of the labeled data for training. [Dataset]. http://doi.org/10.1371/journal.pone.0310203.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 6, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Seung-Ah Lee; Hyun Su Kim; Ehwa Yang; Young Cheol Yoon; Ji Hyun Lee; Byung-Ok Choi; Jae-Hun Kim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Average dice coefficients of the few-supervised learning models using 2%, 5%, and 10% of the labeled data, and semi-supervised learning models using 10% of the labeled data for training.

  6. f

    Sentiment140 tweet statistics.

    • plos.figshare.com
    xls
    Updated Apr 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maha Ijaz; Naveed Anwar; Mejdl Safran; Sultan Alfarhood; Tariq Sadad; Imran (2024). Sentiment140 tweet statistics. [Dataset]. http://doi.org/10.1371/journal.pone.0297028.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 1, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Maha Ijaz; Naveed Anwar; Mejdl Safran; Sultan Alfarhood; Tariq Sadad; Imran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Machine learning techniques that rely on textual features or sentiment lexicons can lead to erroneous sentiment analysis. These techniques are especially vulnerable to domain-related difficulties, especially when dealing in Big data. In addition, labeling is time-consuming and supervised machine learning algorithms often lack labeled data. Transfer learning can help save time and obtain high performance with fewer datasets in this field. To cope this, we used a transfer learning-based Multi-Domain Sentiment Classification (MDSC) technique. We are able to identify the sentiment polarity of text in a target domain that is unlabeled by looking at reviews in a labelled source domain. This research aims to evaluate the impact of domain adaptation and measure the extent to which transfer learning enhances sentiment analysis outcomes. We employed transfer learning models BERT, RoBERTa, ELECTRA, and ULMFiT to improve the performance in sentiment analysis. We analyzed sentiment through various transformer models and compared the performance of LSTM and CNN. The experiments are carried on five publicly available sentiment analysis datasets, namely Hotel Reviews (HR), Movie Reviews (MR), Sentiment140 Tweets (ST), Citation Sentiment Corpus (CSC), and Bioinformatics Citation Corpus (BCC), to adapt multi-target domains. The performance of numerous models employing transfer learning from diverse datasets demonstrating how various factors influence the outputs.

  7. f

    Data from: A General M-estimation Theory in Semi-Supervised Framework

    • tandf.figshare.com
    application/x-rar
    Updated Feb 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shanshan Song; Yuanyuan Lin; Yong Zhou (2024). A General M-estimation Theory in Semi-Supervised Framework [Dataset]. http://doi.org/10.6084/m9.figshare.22191384.v1
    Explore at:
    application/x-rarAvailable download formats
    Dataset updated
    Feb 13, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Shanshan Song; Yuanyuan Lin; Yong Zhou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We study a class of general M-estimators in the semi-supervised setting, wherein the data are typically a combination of a relatively small labeled dataset and large amounts of unlabeled data. A new estimator, which efficiently uses the useful information contained in the unlabeled data, is proposed via a projection technique. We prove consistency and asymptotic normality, and provide an inference procedure based on K-fold cross-validation. The optimal weights are derived to balance the contributions of the labeled and unlabeled data. It is shown that the proposed method, by taking advantage of the unlabeled data, produces asymptotically more efficient estimation of the target parameters than the supervised counterpart. Supportive numerical evidence is shown in simulation studies. Applications are illustrated in analysis of the homeless data in Los Angeles. Supplementary materials for this article are available online.

  8. Synthetic and Unlabeled Dataset for Urban Seismic Event Detection (USED)

    • zenodo.org
    zip
    Updated Feb 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Parth Sagar Hasabnis; Parth Sagar Hasabnis; Yunyue Elita Li; Yunyue Elita Li; Yumin Zhao; Yumin Zhao; Alex Nilot Enhedelihai; Alex Nilot Enhedelihai; Gang Fang; Gang Fang (2024). Synthetic and Unlabeled Dataset for Urban Seismic Event Detection (USED) [Dataset]. http://doi.org/10.5281/zenodo.10724593
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 28, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Parth Sagar Hasabnis; Parth Sagar Hasabnis; Yunyue Elita Li; Yunyue Elita Li; Yumin Zhao; Yumin Zhao; Alex Nilot Enhedelihai; Alex Nilot Enhedelihai; Gang Fang; Gang Fang
    License

    https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html

    Description

    Contains Datasets for training and testing models for Urban Seismic Event Detection (USED).

    1. Strong Dataset: Contains Synthetic Data to be used for supervised learning
    2. Unlabel Dataset: Contains unlabeled data to be used for semi supervised (or unsupervised) learning
    3. Test Synth: Synthetic Dataset to evaluate models
    4. Test Real: Small Real Dataset to evaluate models

    The data is in SAC format, with JSON labels. The obspy library in python can be used to read this data.

  9. A

    ‘BLE RSSI Dataset for Indoor localization’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Nov 21, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘BLE RSSI Dataset for Indoor localization’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-ble-rssi-dataset-for-indoor-localization-f7ec/641e5a0f/?iid=005-634&v=presentation
    Explore at:
    Dataset updated
    Nov 21, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘BLE RSSI Dataset for Indoor localization’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/mehdimka/ble-rssi-dataset on 20 November 2021.

    --- Dataset description provided by original source is as follows ---

    Content

    The dataset was created using the RSSI readings of an array of 13 ibeacons in the first floor of Waldo Library, Western Michigan University. Data was collected using iPhone 6S. The dataset contains two sub-datasets: a labeled dataset (1420 instances) and an unlabeled dataset (5191 instances). The recording was performed during the operational hours of the library. For the labeled dataset, the input data contains the location (label column), a timestamp, followed by RSSI readings of 13 iBeacons. RSSI measurements are negative values. Bigger RSSI values indicate closer proximity to a given iBeacon (e.g., RSSI of -65 represent a closer distance to a given iBeacon compared to RSSI of -85). For out-of-range iBeacons, the RSSI is indicated by -200. The locations related to RSSI readings are combined in one column consisting a letter for the column and a number for the row of the position. The following figure depicts the layout of the iBeacons as well as the arrange of locations.

    https://www.kaggle.com/mehdimka/ble-rssi-dataset/downloads/iBeacon_Layout.jpg" alt="iBeacons Layout">

    Attribute Information

    • location: The location of receiving RSSIs from ibeacons b3001 to b3013; symbolic values showing the column and row of the location on the map (e.g., A01 stands for column A, row 1).
    • date: Datetime in the format of ‘d-m-yyyy hh:mm:ss’
    • b3001 - b3013: RSSI readings corresponding to the iBeacons; numeric, integers only.

    Acknowledgements

    Provider: Mehdi Mohammadi and Ala Al-Fuqaha, {mehdi.mohammadi, ala-alfuqaha}@wmich.edu, Department of Computer Science, Western Michigan University

    Citation Request:

    M. Mohammadi, A. Al-Fuqaha, M. Guizani, J. Oh, “Semi-supervised Deep Reinforcement Learning in Support of IoT and Smart City Services,” IEEE Internet of Things Journal, Vol. PP, No. 99, 2017.

    Inspiration

    How unlabeled data can help for an improved learning system. How a GAN model can synthesizes viable paths based on the little labeled data and larger set of unlabeled data.

    --- Original source retains full ownership of the source dataset ---

  10. t

    Square dataset - Dataset - LDM

    • service.tib.eu
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Square dataset - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/square-dataset
    Explore at:
    Dataset updated
    Dec 2, 2024
    Description

    The dataset used in the paper is a wide domain image dataset, and the authors propose a weakly semi-supervised method for disentangling using both labeled and unlabeled data.

  11. t

    CAS-PEAL-R1 dataset - Dataset - LDM

    • service.tib.eu
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). CAS-PEAL-R1 dataset - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/cas-peal-r1-dataset
    Explore at:
    Dataset updated
    Dec 2, 2024
    Description

    The dataset used in the paper is a wide domain image dataset, and the authors propose a weakly semi-supervised method for disentangling using both labeled and unlabeled data.

  12. h

    SemiEvol

    • huggingface.co
    Updated Oct 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    junyu (2024). SemiEvol [Dataset]. https://huggingface.co/datasets/luojunyu/SemiEvol
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 22, 2024
    Authors
    junyu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for Dataset Name

    The SemiEvol dataset is part of the broader work on semi-supervised fine-tuning for Large Language Models (LLMs). The dataset includes labeled and unlabeled data splits designed to enhance the reasoning capabilities of LLMs through a bi-level knowledge propagation and selection framework, as proposed in the paper SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation.

      Dataset Details
    
    
    
    
    
    
    
      Dataset Sources [optional]… See the full description on the dataset page: https://huggingface.co/datasets/luojunyu/SemiEvol.
    
  13. Amos: A large-scale abdominal multi-organ benchmark for versatile medical...

    • zenodo.org
    csv, zip
    Updated Nov 7, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    YuanfengJi; YuanfengJi (2022). Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation (Unlabeled Data Part II) [Dataset]. http://doi.org/10.5281/zenodo.7295661
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Nov 7, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    YuanfengJi; YuanfengJi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans in recent years, a comprehensive evaluation of the models' capabilities is hampered by the lack of a large-scale benchmark from diverse clinical scenarios. Constraint by the high cost of collecting and labeling 3D medical data, most of the deep learning models to date are driven by datasets with a limited number of organs of interest or samples, which still limits the power of modern deep models and makes it difficult to provide a fully comprehensive and fair estimate of various methods. To mitigate the limitations, we present AMOS, a large-scale, diverse, clinical dataset for abdominal organ segmentation. AMOS provides 500 CT and 100 MRI scans collected from multi-center, multi-vendor, multi-modality, multi-phase, multi-disease patients, each with voxel-level annotations of 15 abdominal organs, providing challenging examples and test-bed for studying robust segmentation algorithms under diverse targets and scenarios. We further benchmark several state-of-the-art medical segmentation models to evaluate the status of the existing methods on this new challenging dataset. We have made our datasets, benchmark servers, and baselines publicly available, and hope to inspire future research. The paper can be found at https://arxiv.org/pdf/2206.08023.pdf

    In addition to providing the labeled 600 CT and MRI scans, we expect to provide 2000 CT and 1200 MRI scans without labels to support more learning tasks (semi-supervised, un-supervised, domain adaption, ...). The link can be found in:

    if you found this dataset useful for your research, please cite:

    @article{ji2022amos,
     title={AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation},
     author={Ji, Yuanfeng and Bai, Haotian and Yang, Jie and Ge, Chongjian and Zhu, Ye and Zhang, Ruimao and Li, Zhen and Zhang, Lingyan and Ma, Wanling and Wan, Xiang and others},
     journal={arXiv preprint arXiv:2206.08023},
     year={2022}
    
    
  14. H

    Replication Data for: Improving Probabilistic Models in Text Classification...

    • dataverse.harvard.edu
    Updated Aug 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mitchell Bosley; Saki Kuzushima; Ted Enamorado; Yuki Shiraito (2024). Replication Data for: Improving Probabilistic Models in Text Classification via Active Learning [Dataset]. http://doi.org/10.7910/DVN/7DOXQY
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 6, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Mitchell Bosley; Saki Kuzushima; Ted Enamorado; Yuki Shiraito
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/7DOXQYhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/7DOXQY

    Description

    Social scientists often classify text documents to use the resulting labels as an outcome or a predictor in empirical research. Automated text classification has become a standard tool, since it requires less human coding. However, scholars still need many human-labeled documents for training. To reduce labeling costs, we propose a new algorithm for text classification that combines a probabilistic model with active learning. The probabilistic model uses both labeled and unlabeled data, and active learning concentrates labeling efforts on difficult documents to classify. Our validation study shows that with few labeled data the classification performance of our algorithm is comparable to state-of-the-art methods at a fraction of the computational cost. We replicate the results of two published articles with only a small fraction of the original labeled data used in those studies, and provide open-source software to implement our method.

  15. f

    Number of images used for the training and testing of the models with...

    • plos.figshare.com
    xls
    Updated Sep 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seung-Ah Lee; Hyun Su Kim; Ehwa Yang; Young Cheol Yoon; Ji Hyun Lee; Byung-Ok Choi; Jae-Hun Kim (2024). Number of images used for the training and testing of the models with different labeling strategies. [Dataset]. http://doi.org/10.1371/journal.pone.0310203.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 6, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Seung-Ah Lee; Hyun Su Kim; Ehwa Yang; Young Cheol Yoon; Ji Hyun Lee; Byung-Ok Choi; Jae-Hun Kim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Number of images used for the training and testing of the models with different labeling strategies.

  16. Data used in Machine learning reveals the waggle drift's role in the honey...

    • zenodo.org
    • data.europa.eu
    csv, zip
    Updated May 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David M Dormagen; David M Dormagen; Benjamin Wild; Benjamin Wild; Fernando Wario; Fernando Wario; Tim Landgraf; Tim Landgraf (2023). Data used in Machine learning reveals the waggle drift's role in the honey bee dance communication system [Dataset]. http://doi.org/10.5281/zenodo.7928121
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    May 18, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    David M Dormagen; David M Dormagen; Benjamin Wild; Benjamin Wild; Fernando Wario; Fernando Wario; Tim Landgraf; Tim Landgraf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data and metadata used in "Machine learning reveals the waggle drift’s role in the honey bee dance communication system"

    All timestamps are given in ISO 8601 format.

    The following files are included:

    Berlin2019_waggle_phases.csv, Berlin2021_waggle_phases.csv

    Automatic individual detections of waggle phases during our recording periods in 2019 and 2021.

    • timestamp: Date and time of the detection.

    • cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).

    • x_median, y_median: Median position of the bee during the waggle phase (for 2019 given in millimeters after applying a homography, for 2021 in the original image coordinates).

    • waggle_angle: Body orientation of the bee during the waggle phase in radians (0: oriented to the right, PI / 4: oriented upwards).

    Berlin2019_dances.csv

    Automatic detections of dance behavior during our recording period in 2019.

    • dancer_id: Unique ID of the individual bee.

    • dance_id: Unique ID of the dance.

    • ts_from, ts_to: Date and time of the beginning and end of the dance.

    • cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).

    • median_x, median_y: Median position of the individual during the dance.

    • feeder_cam_id: ID of the feeder that the bee was detected at prior to the dance.

    Berlin2019_followers.csv

    Automatic detections of attendance and following behavior, corresponding to the dances in Berlin2019_dances.csv.

    • dance_id: Unique ID of the dance being attended or followed.

    • follower_id: Unique ID of the individual attending or following the dance.

    • ts_from, ts_to: Date and time of the beginning and end of the interaction.

    • label: “attendance” or “follower”

    • cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).

    Berlin2019_dances_with_manually_verified_times.csv

    A sample of dances from Berlin2019_dances.csv where the exact timestamps have been manually verified to correspond to the beginning of the first and last waggle phase down to a precision of ca. 166 ms (video material was recorded at 6 FPS).

    • dance_id: Unique ID of the dance.

    • dancer_id: Unique ID of the dancing individual.

    • cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).

    • feeder_cam_id: ID of the feeder that the bee was detected at prior to the dance.

    • dance_start, dance_end: Manually verified date and times of the beginning and end of the dance.

    Berlin2019_dance_classifier_labels.csv

    Manually annotated waggle phases or following behavior for our recording season in 2019 that was used to train the dancing and following classifier. Can be merged with the supplied individual detections.

    • timestamp: Timestamp of the individual frame the behavior was observed in.

    • frame_id: Unique ID of the video frame the behavior was observed in.

    • bee_id: Unique ID of the individual bee.

    • label: One of “nothing”, “waggle”, “follower”

    Berlin2019_dance_classifier_unlabeled.csv

    Additional unlabeled samples of timestamp and individual ID with the same format as Berlin2019_dance_classifier_labels.csv, but without a label. The data points have been sampled close to detections of our waggle phase classifier, so behaviors related to the waggle dance are likely overrepresented in that sample.

    Berlin2021_waggle_phase_classifier_labels.csv

    Manually annotated detections of our waggle phase detector (bb_wdd2) that were used to train the neural network filter (bb_wdd_filter) for the 2021 data.

    • detection_id: Unique ID of the waggle phase.

    • label: One of “waggle”, “activating”, “ventilating”, “trembling”, “other”. Where “waggle” denoted a waggle phase, “activating” is the shaking signal, “ventilating” is a bee fanning her wings. “trembling” denotes a tremble dance, but the distinction from the “other” class was often not clear, so “trembling” was merged into “other” for training.

    • orientation: The body orientation of the bee that triggered the detection in radians (0: facing to the right, PI /4: facing up).

    • metadata_path: Path to the individual detection in the same directory structure as created by the waggle dance detector.

    Berlin2021_waggle_phase_classifier_ground_truth.zip

    The output of the waggle dance detector (bb_wdd2) that corresponds to Berlin2021_waggle_phase_classifier_labels.csv and is used for training. The archive includes a directory structure as output by the bb_wdd2 and each directory includes the original image sequence that triggered the detection in an archive and the corresponding metadata. The training code supplied in bb_wdd_filter directly works with this directory structure.

    Berlin2019_tracks.zip

    Detections and tracks from the recording season in 2019 as produced by our tracking system. As the full data is several terabytes in size, we include the subset of our data here that is relevant for our publication which comprises over 46 million detections. We included tracks for all detected behaviors (dancing, following, attending) including one minute before and after the behavior. We also included all tracks that correspond to the labeled and unlabeled data that was used to train the dance classifier including 30 seconds before and after the data used for training.
    We grouped the exported data by date to make the handling easier, but to efficiently work with the data, we recommend importing it into an indexable database.

    The individual files contain the following columns:

    • cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).

    • timestamp: Date and time of the detection.

    • frame_id: Unique ID of the video frame of the recording from which the detection was extracted.

    • track_id: Unique ID of an individual track (short motion path from one individual). For longer tracks, the detections can be linked based on the bee_id.

    • bee_id: Unique ID of the individual bee.

    • bee_id_confidence: Confidence between 0 and 1 that the bee_id is correct as output by our tracking system.

    • x_pos_hive, y_pos_hive: Spatial position of the bee in the hive on the side indicated by cam_id. Given in millimeters after applying a homography on the video material.

    • orientation_hive: Orientation of the bees’ thorax in the hive in radians (0: oriented to the right, PI / 4: oriented upwards).

    Berlin2019_feeder_experiment_log.csv

    Experiment log for our feeder experiments in 2019.

    • date: Date given in the format year-month-day.

    • feeder_cam_id: Numeric ID of the feeder.

    • coordinates: Longitude and latitude of the feeder. For feeders 1 and 2 this is only given once and held constant. Feeder 3 had varying locations.

    • time_opened, time_closed: Date and time when the feeder was set up or closed again.
      sucrose_solution: Concentration of the sucrose solution given as sugar:water (in terms of weight). On days where feeder 3 was open, the other two feeders offered water without sugar.

    Software used to acquire and analyze the data:

  17. O

    STL-10

    • opendatalab.com
    zip
    Updated Aug 24, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Michigan (2022). STL-10 [Dataset]. https://opendatalab.com/OpenDataLab/STL-10
    Explore at:
    zip(5978439104 bytes)Available download formats
    Dataset updated
    Aug 24, 2022
    Dataset provided by
    Stanford University
    University of Michigan
    Description
    Inspired by the CIFAR-10 dataset, STL-10 is an image recognition dataset for the development of unsupervised machine and feature learning as well as deep learning algorithms. Each class has fewer number of labeled training examples compared to CIFAR-10, and a large set of unlabeled samples is provided to learn image models prior to training the models. The primary challenge is to utilize the unlabeled data. With the higher resolution (96x96) of this dataset, it is expected that will be a more challenging benchmark to attain when developing such scalable unsupervised ML models.
    
  18. D

    Self-Supervised Learning Market Report | Global Forecast From 2025 To 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2024). Self-Supervised Learning Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-self-supervised-learning-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Sep 23, 2024
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Self-Supervised Learning Market Outlook



    As of 2023, the global self-supervised learning market size is valued at approximately USD 1.5 billion and is expected to escalate to around USD 10.8 billion by 2032, reflecting a compound annual growth rate (CAGR) of 24.1% during the forecast period. This robust growth is driven by the increasing demand for advanced AI models that can learn from large volumes of unlabeled data, significantly reducing the dependency on labeled datasets, thereby making AI training more cost-effective and scalable.



    The growth of the self-supervised learning market is fueled by several factors, one of which is the exponential increase in data generation. With the proliferation of digital devices, IoT technologies, and social media platforms, there is an unprecedented amount of data being created every second. Self-supervised learning models leverage this vast amount of unlabeled data to train themselves, making them particularly valuable in industries where data labeling is time-consuming and expensive. This capability is especially pertinent in fields like healthcare, finance, and retail, where the rapid analysis of extensive datasets can lead to significant advancements in predictive analytics and customer insights.



    Another critical driver is the advancement in computational technologies that support more sophisticated machine learning models. The development of more powerful GPUs and cloud-based AI platforms has enabled the efficient training and deployment of self-supervised learning models. These technological advancements not only reduce the time required for training but also enhance the accuracy and performance of the models. Furthermore, the integration of self-supervised learning with other AI paradigms such as reinforcement learning and deep learning is opening new avenues for research and application, further propelling market growth.



    The increasing adoption of AI across various industries is also a significant growth factor. Businesses are increasingly recognizing the potential of AI to optimize operations, enhance customer experiences, and drive innovation. Self-supervised learning, with its ability to make sense of large, unstructured datasets, is becoming a cornerstone of AI strategies across sectors. For instance, in the healthcare sector, self-supervised learning is being used to develop predictive models for disease diagnosis and treatment planning, while in the finance sector, it aids in fraud detection and risk management.



    Regionally, North America is expected to dominate the self-supervised learning market, owing to the presence of leading technology companies and extensive R&D activities in AI. However, the Asia Pacific region is anticipated to witness the fastest growth during the forecast period, driven by rapid digital transformation, increasing investment in AI technologies, and supportive government initiatives. Europe also presents a significant market opportunity, with a strong focus on AI research and development, particularly in countries like Germany, the UK, and France.



    Component Analysis



    The self-supervised learning market is segmented by component into software, hardware, and services. The software segment is expected to hold the largest market share, driven by the development and adoption of advanced AI algorithms and platforms. These software solutions are designed to leverage the vast amounts of unlabeled data available, making them highly valuable for various applications such as natural language processing, computer vision, and predictive analytics. Furthermore, continuous advancements in software capabilities, such as improved model training techniques and enhanced data preprocessing tools, are expected to fuel the growth of this segment.



    The hardware segment, while smaller in comparison to software, is crucial for the efficient deployment of self-supervised learning models. This includes high-performance computing systems, GPUs, and specialized AI accelerators that provide the necessary computational power to train and run complex AI models. Innovations in hardware technology, such as the development of more energy-efficient and powerful processing units, are expected to drive growth in this segment. Additionally, the increasing adoption of edge computing devices that can perform AI tasks locally, thereby reducing latency and bandwidth usage, is also contributing to the expansion of the hardware segment.



    Services are another vital component of the self-supervised learning market. This segment encompasses various professional services such as consulting, int

  19. h

    IITKGP_Fence_dataset

    • huggingface.co
    Updated Sep 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moushumi Medhi (2024). IITKGP_Fence_dataset [Dataset]. https://huggingface.co/datasets/NeuroVizv0yaZ3R/IITKGP_Fence_dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 20, 2024
    Authors
    Moushumi Medhi
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    IITKGP_Fence dataset

      Overview
    

    The IITKGP_Fence dataset is designed for tasks related to fence-like occlusion detection, defocus blur, depth mapping, and object segmentation. The captured data vaies in scene composition, background defocus, and object occlusions. The dataset comprises both labeled and unlabeled data, as well as additional video and RGB-D data. The contains ground truth occlusion masks (GT) for the corresponding images. We created the ground truth… See the full description on the dataset page: https://huggingface.co/datasets/NeuroVizv0yaZ3R/IITKGP_Fence_dataset.

  20. STL10-Labeled Image Recognition Dataset

    • kaggle.com
    Updated Aug 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Semih Yagli (2025). STL10-Labeled Image Recognition Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/12688697
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 6, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Semih Yagli
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This public dataset contains labels for the unlabeled 100,000 pictures in the STL-10 dataset.

    The dataset is human labeled with AI aid through Etiqueta, the one and only gamified mobile data labeling application. stl10.py is a python script written by Martin Tutek to download the complete STL10 dataset. labels.json contains labels for the 100,000 previously unlabeled images in the STL10 dataset legend.json is a mapping of the labels used. stats.ipynb presents a few statistics regarding the 100,000 newly labeled images.

    If you use this dataset in your research please cite the following:

    @techreport{yagli2025etiqueta,
     author = {Semih Yagli},
     title = {Etiqueta: AI-Aided, Gamified Data Labeling to Label and Segment Data},
     year = {2025},
     number = {TR-2025-0001},
     address = {NJ, USA},
     month = Apr.,
     url = {https://www.aidatalabel.com/technical_reports/aidatalabel_tr_2025_0001.pdf},
     institution = {AI Data Label},
    }
    
    @inproceedings{coates2011analysis,
      title = {An analysis of single-layer networks in unsupervised feature learning},
      author = {Coates, Adam and Ng, Andrew and Lee, Honglak},
      booktitle = {Proceedings of the fourteenth international conference on artificial intelligence and statistics},
      pages = {215--223},
      year = {2011},
      organization = {JMLR Workshop and Conference Proceedings}
    }
    

    Note: The dataset is imported to Kaggle from: https://github.com/semihyagli/STL10-Labeled See also: https://github.com/semihyagli/STL10_Segmentation

    If you have comments and questions about Etiqueta or about this dataset, please reach us out at contact@aidatalabel.com

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Data Diggers (2023). Unlabelled dataset [Dataset]. https://www.kaggle.com/datasets/ahmedaliraja/unlabelled-dataset
Organization logo

Unlabelled dataset

Unlabeled Dataset: Exploring Uncharted Data Territories

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 29, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Data Diggers
Description

This dataset consists of unlabeled data representing various data points collected from different sources and domains. The dataset serves as a blank canvas for unsupervised learning experiments, allowing for the exploration of patterns, clusters, and hidden insights through various data analysis techniques. Researchers and data enthusiasts can use this dataset to develop and test unsupervised learning algorithms, identify underlying structures, and gain a deeper understanding of data without predefined labels.

Search
Clear search
Close search
Google apps
Main menu