100+ datasets found
  1. d

    Data from: Pseudo-Label Generation for Multi-Label Text Classification

    • catalog.data.gov
    • datasets.ai
    • +2more
    Updated Dec 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2023). Pseudo-Label Generation for Multi-Label Text Classification [Dataset]. https://catalog.data.gov/dataset/pseudo-label-generation-for-multi-label-text-classification
    Explore at:
    Dataset updated
    Dec 6, 2023
    Dataset provided by
    Dashlink
    Description

    With the advent and expansion of social networking, the amount of generated text data has seen a sharp increase. In order to handle such a huge volume of text data, new and improved text mining techniques are a necessity. One of the characteristics of text data that makes text mining difficult, is multi-labelity. In order to build a robust and effective text classification method which is an integral part of text mining research, we must consider this property more closely. This kind of property is not unique to text data as it can be found in non-text (e.g., numeric) data as well. However, in text data, it is most prevalent. This property also puts the text classification problem in the domain of multi-label classification (MLC), where each instance is associated with a subset of class-labels instead of a single class, as in conventional classification. In this paper, we explore how the generation of pseudo labels (i.e., combinations of existing class labels) can help us in performing better text classification and under what kind of circumstances. During the classification, the high and sparse dimensionality of text data has also been considered. Although, here we are proposing and evaluating a text classification technique, our main focus is on the handling of the multi-labelity of text data while utilizing the correlation among multiple labels existing in the data set. Our text classification technique is called pseudo-LSC (pseudo-Label Based Subspace Clustering). It is a subspace clustering algorithm that considers the high and sparse dimensionality as well as the correlation among different class labels during the classification process to provide better performance than existing approaches. Results on three real world multi-label data sets provide us insight into how the multi-labelity is handled in our classification process and shows the effectiveness of our approach.

  2. Data from: Towards Automatic Labeling of Exception Handling Bugs: A Case...

    • figshare.com
    zip
    Updated Apr 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Renan Vieira (2024). Towards Automatic Labeling of Exception Handling Bugs: A Case Study of 10 Years Bug-Fixing in Apache Hadoop [Dataset]. http://doi.org/10.6084/m9.figshare.22735124.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 29, 2024
    Dataset provided by
    figshare
    Authors
    Renan Vieira
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context: Exception handling (EH) bugs stem from incorrect usage of exception handling mechanisms (EHMs) and often incur severe consequences (e.g., system downtime, data loss, and security risk). Tracking EH bugs is particularly relevant for contemporary systems (e.g., cloud- and AI-based systems), in which the software's sophisticated logic is an additional threat to the correct use of the EHM. On top of that, bug reporters seldom can tag EH bugs --- since it may require an encompassing knowledge of the software's EH strategy. Surprisingly, to the best of our knowledge, there is no automated procedure to identify EH bugs from report descriptions.Objective: First, we aim to evaluate the extent to which Natural Language Processing (NLP) and Machine Learning (ML) can be used to reliably label EH bugs using the text fields from bug reports (e.g., summary, description, and comments). Second, we aim to provide a reliably labeled dataset that the community can use in future endeavors. Overall, we expect our work to raise the community's awareness regarding the importance of EH bugs.Method: We manually analyzed 4,516 bug reports from the four main components of Apache’s Hadoop project, out of which we labeled ~20% (943) as EH bugs. We also labeled 2,584 non-EH bugs analyzing their bug-fixing code and creating a dataset composed of 7,100 bug reports. Then, we used word embedding techniques (Bag-of-Words and TF-IDF) to summarize the textual fields of bug reports. Subsequently, we used these embeddings to fit five classes of ML methods and evaluate them on unseen data. We also evaluated a pre-trained transformer-based model using the complete textual fields. We have also evaluated whether considering only EH keywords is enough to achieve high predictive performance.Results: Our results show that using a pre-trained DistilBERT with a linear layer trained with our proposed dataset can reasonably label EH bugs, achieving ROC-AUC scores of up to 0.88. The combination of NLP and ML traditional techniques achieved ROC-AUC scores of up to 0.74 and recall up to 0.56. As a sanity check, we also evaluate methods using embeddings extracted solely from keywords. Considering ROC-AUC as the primary concern, for the majority of ML methods tested, the analysis suggests that keywords alone are not sufficient to characterize reports of EH bugs, although this can change based on other metrics (such as recall and precision) or ML methods (e.g., Random Forest).Conclusions: To the best of our knowledge, this is the first study addressing the problem of automatic labeling of EH bugs. Based on our results, we can conclude that the use of ML techniques, specially transformer-base models, sounds promising to automate the task of labeling EH bugs. Overall, we hope (i) that our work will contribute towards raising awareness around EH bugs; and (ii) that our (publicly available) dataset will serve as a benchmarking dataset, paving the way for follow-up works. Additionally, our findings can be used to build tools that help maintainers flesh out EH bugs during the triage process.

  3. D

    Data Labeling Solution and Services Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Mar 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AMA Research & Media LLP (2025). Data Labeling Solution and Services Report [Dataset]. https://www.archivemarketresearch.com/reports/data-labeling-solution-and-services-52811
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Mar 7, 2025
    Dataset provided by
    AMA Research & Media LLP
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Data Labeling Solutions and Services market is experiencing robust growth, driven by the escalating demand for high-quality training data in the artificial intelligence (AI) and machine learning (ML) sectors. The market, estimated at $15 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033, reaching approximately $75 billion by 2033. This expansion is fueled by several key factors. Firstly, the increasing adoption of AI across diverse industries, including automotive, healthcare, and finance, necessitates vast amounts of accurately labeled data for model training and improvement. Secondly, advancements in deep learning algorithms and the emergence of sophisticated data annotation tools are streamlining the labeling process, boosting efficiency and reducing costs. Finally, the growing availability of diverse data sources, coupled with the rise of specialized data labeling companies, is further contributing to market growth. Despite these positive trends, the market faces some challenges. The high cost associated with data annotation, particularly for complex datasets requiring specialized expertise, can be a barrier for smaller businesses. Ensuring data quality and consistency across large-scale projects remains a critical concern, necessitating robust quality control measures. Furthermore, addressing data privacy and security issues is essential to maintain ethical standards and build trust within the market. The market segmentation by type (text, image/video, audio) and application (automotive, government, healthcare, financial services, etc.) presents significant opportunities for specialized service providers catering to niche needs. Competition is expected to intensify as new players enter the market, focusing on innovative solutions and specialized services.

  4. Lego Part Label Classification Dataset

    • universe.roboflow.com
    zip
    Updated Apr 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PDP lego classification (2023). Lego Part Label Classification Dataset [Dataset]. https://universe.roboflow.com/pdp-lego-classification/lego-part-label-classification
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 25, 2023
    Dataset provided by
    The Lego Grouphttp://lego.com/
    Authors
    PDP lego classification
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Lego
    Description

    Lego Part Label Classification

    ## Overview
    
    Lego Part Label Classification is a dataset for classification tasks - it contains Lego annotations for 301 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  5. Multi-Label Classification Dataset

    • kaggle.com
    Updated Jan 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shivanand (2021). Multi-Label Classification Dataset [Dataset]. https://www.kaggle.com/datasets/shivanandmn/multilabel-classification-dataset/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 28, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shivanand
    Description

    Context

    NLP: Multi-label Classification Dataset.

    Content

    The dataset contains 6 different labels(Computer Science, Physics, Mathematics, Statistics, Quantitative Biology, Quantitative Finance) to classify the research papers based on Abstract and Title. The value 1 in label columns represents that label belongs to that paper. Each paper has multiple labels as 1.

    Acknowledgements

    This dataset is from Analytics Vidhya Hackthon

    Inspiration

    Do you solve it to get the best score?

  6. Z

    Data from: IoT-23: A labeled dataset with malicious and benign IoT network...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 3, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agustin Parmisano (2021). IoT-23: A labeled dataset with malicious and benign IoT network traffic [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4743745
    Explore at:
    Dataset updated
    Sep 3, 2021
    Dataset provided by
    Sebastian Garcia
    Agustin Parmisano
    Maria Jose Erquiaga
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IoT-23 is a dataset of network traffic from Internet of Things (IoT) devices. It has 20 malware captures executed in IoT devices, and 3 captures for benign IoT devices traffic. It was first published in January 2020, with captures ranging from 2018 to 2019. These IoT network traffic was captured in the Stratosphere Laboratory, AIC group, FEL, CTU University, Czech Republic. Its goal is to offer a large dataset of real and labeled IoT malware infections and IoT benign traffic for researchers to develop machine learning algorithms. This dataset and its research was funded by Avast Software. The malware was allow to connect to the Internet.

  7. Wine Labels Dataset

    • universe.roboflow.com
    zip
    Updated May 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roboflow 100 (2023). Wine Labels Dataset [Dataset]. https://universe.roboflow.com/roboflow-100/wine-labels
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 7, 2023
    Dataset provided by
    Roboflow
    Authors
    Roboflow 100
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Wine Labels Bounding Boxes
    Description

    This dataset was originally created by Yilong Zheng. To see the current project, which may have been updated since this version, please go here: https://universe.roboflow.com/wine-label/wine-label-detection.

    This dataset is part of RF100, an Intel-sponsored initiative to create a new object detection benchmark for model generalizability.

    Access the RF100 Github repo: https://github.com/roboflow-ai/roboflow-100-benchmark

  8. MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE...

    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • data.nasa.gov
    • +1more
    Updated Feb 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/multi-label-asrs-dataset-classification-using-semi-supervised-subspace-clustering
    Explore at:
    Dataset updated
    Feb 19, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING MOHAMMAD SALIM AHMED, LATIFUR KHAN, NIKUNJ OZA, AND MANDAVA RAJESWARI Abstract. There has been a lot of research targeting text classification. Many of them focus on a particular characteristic of text data - multi-labelity. This arises due to the fact that a document may be associated with multiple classes at the same time. The consequence of such a characteristic is the low performance of traditional binary or multi-class classification techniques on multi-label text data. In this paper, we propose a text classification technique that considers this characteristic and provides very good performance. Our multi-label text classification approach is an extension of our previously formulated [3] multi-class text classification approach called SISC (Semi-supervised Impurity based Subspace Clustering). We call this new classification model as SISC-ML(SISC Multi-Label). Empirical evaluation on real world multi-label NASA ASRS (Aviation Safety Reporting System) data set reveals that our approach outperforms state-of-theart text classification as well as subspace clustering algorithms.

  9. Quantitative Content Analysis Data for Hand Labeling Road Surface Conditions...

    • zenodo-rdm.web.cern.ch
    • data.niaid.nih.gov
    zip
    Updated Sep 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carly Sutter; Carly Sutter; Kara Sulia; Kara Sulia; Nick P. Bassill; Nick P. Bassill; Christopher D. Thorncroft; Christopher D. Wirz; Christopher D. Wirz; Vanessa Przybylo; Vanessa Przybylo; Mariana G. Cains; Mariana G. Cains; Jacob Radford; Jacob Radford; David Aaron Evans; David Aaron Evans; Christopher D. Thorncroft (2023). Quantitative Content Analysis Data for Hand Labeling Road Surface Conditions in New York State Department of Transportation Camera Images [Dataset]. http://doi.org/10.5281/zenodo.8370665
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 27, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Carly Sutter; Carly Sutter; Kara Sulia; Kara Sulia; Nick P. Bassill; Nick P. Bassill; Christopher D. Thorncroft; Christopher D. Wirz; Christopher D. Wirz; Vanessa Przybylo; Vanessa Przybylo; Mariana G. Cains; Mariana G. Cains; Jacob Radford; Jacob Radford; David Aaron Evans; David Aaron Evans; Christopher D. Thorncroft
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    New York
    Description

    Traffic camera images from the New York State Department of Transportation (511ny.org) are used to create a hand-labeled dataset of images classified into to one of six road surface conditions: 1) severe snow, 2) snow, 3) wet, 4) dry, 5) poor visibility, or 6) obstructed. Six labelers (authors Sutter, Wirz, Przybylo, Cains, Radford, and Evans) went through a series of four labeling trials where reliability across all six labelers were assessed using the Krippendorff’s alpha (KA) metric (Krippendorff, 2007). The online tool by Dr. Freelon (Freelon, 2013; Freelon, 2010) was used to calculate reliability metrics after each trial, and the group achieved inter-coder reliability with KA of 0.888 on the 4th trial. This process is known as quantitative content analysis, and three pieces of data used in this process are shared, including: 1) a PDF of the codebook which serves as a set of rules for labeling images, 2) images from each of the four labeling trials, including the use of New York State Mesonet weather observation data (Brotzge et al., 2020), and 3) an Excel spreadsheet including the calculated inter-coder reliability metrics and other summaries used to asses reliability after each trial.

    The broader purpose of this work is that the six human labelers, after achieving inter-coder reliability, can then label large sets of images independently, each contributing to the creation of larger labeled dataset used for training supervised machine learning models to predict road surface conditions from camera images. The xCITE lab (xCITE, 2023) is used to store camera images from 511ny.org, and the lab provides computing resources for training machine learning models.

  10. Git labeled dataset of image diagrams

    • figshare.com
    txt
    Updated Sep 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergio Andres Rodriguez Torres (2022). Git labeled dataset of image diagrams [Dataset]. http://doi.org/10.6084/m9.figshare.20400999.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Sep 13, 2022
    Dataset provided by
    figshare
    Authors
    Sergio Andres Rodriguez Torres
    License

    https://www.apache.org/licenses/LICENSE-2.0.htmlhttps://www.apache.org/licenses/LICENSE-2.0.html

    Description

    Datset of 3'960.877 images build from GitHub public repositories. This dataset contains a column product of the automatic classification process of a machine learning convolutional network, with 6 posible categories related to software diagrams. Label Name
    0 None
    1 Activity Diagram
    2 Sequence Diagram
    3 Class Diagram
    4 Component Diagram
    5 Use Case Diagram
    6 Cloud Diagram
    It also includes information on the repository from which it was extracted.

  11. d

    TagX Data Annotation | Automated Annotation | AI-assisted labeling with...

    • datarade.ai
    Updated Aug 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TagX (2022). TagX Data Annotation | Automated Annotation | AI-assisted labeling with human verification | Customized annotation | Data for AI & LLMs [Dataset]. https://datarade.ai/data-products/data-annotation-services-for-artificial-intelligence-and-data-tagx
    Explore at:
    .json, .xml, .csv, .xls, .txtAvailable download formats
    Dataset updated
    Aug 14, 2022
    Dataset authored and provided by
    TagX
    Area covered
    Sint Eustatius and Saba, Saint Barthélemy, Egypt, Estonia, Lesotho, Central African Republic, Comoros, Guatemala, Georgia, Cabo Verde
    Description

    TagX data annotation services are a set of tools and processes used to accurately label and classify large amounts of data for use in machine learning and artificial intelligence applications. The services are designed to be highly accurate, efficient, and customizable, allowing for a wide range of data types and use cases.

    The process typically begins with a team of trained annotators reviewing and categorizing the data, using a variety of annotation tools and techniques, such as text classification, image annotation, and video annotation. The annotators may also use natural language processing and other advanced techniques to extract relevant information and context from the data.

    Once the data has been annotated, it is then validated and checked for accuracy by a team of quality assurance specialists. Any errors or inconsistencies are corrected, and the data is then prepared for use in machine learning and AI models.

    TagX annotation services can be applied to a wide range of data types, including text, images, videos, and audio. The services can be customized to meet the specific needs of each client, including the type of data, the level of annotation required, and the desired level of accuracy.

    TagX data annotation services provide a powerful and efficient way to prepare large amounts of data for use in machine learning and AI applications, allowing organizations to extract valuable insights and improve their decision-making processes.

  12. R

    Label Data Dataset

    • universe.roboflow.com
    zip
    Updated Mar 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KHKT 20252 (2025). Label Data Dataset [Dataset]. https://universe.roboflow.com/khkt-20252/label-data-khcp8/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 5, 2025
    Dataset authored and provided by
    KHKT 20252
    Variables measured
    Objects Polygons
    Description

    Label Data

    ## Overview
    
    Label Data is a dataset for instance segmentation tasks - it contains Objects annotations for 1,955 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
  13. s

    Rail Line Labeling Dataset

    • hr.shaip.com
    • maadaa.ai
    • +69more
    json
    Updated Dec 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Rail Line Labeling Dataset [Dataset]. https://hr.shaip.com/offerings/machine-industry-datasets/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Dec 25, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The Rail Line Labeling Dataset is tailored for industrial applications, featuring a collection of internet-collected images with a resolution of 1920 x 1080 pixels. This dataset specializes in the detailed labeling of rail lines, including their turns and merges, using polygon annotations. Additionally, trains within these images are labeled with bounding boxes. The dataset specifically focuses on rail networks collected from Wuhan, providing a localized context for rail line analysis and train detection.

  14. Data from: label-files

    • huggingface.co
    Updated Dec 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    label-files [Dataset]. https://huggingface.co/datasets/huggingface/label-files
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 23, 2021
    Dataset authored and provided by
    Hugging Facehttps://huggingface.co/
    Description

    This repository contains the mapping from integer id's to actual label names (in HuggingFace Transformers typically called id2label) for several datasets. Current datasets include:

    ImageNet-1k ImageNet-22k (also called ImageNet-21k as there are 21,843 classes) COCO detection 2017 COCO panoptic 2017 ADE20k (actually, the MIT Scene Parsing benchmark, which is a subset of ADE20k) Cityscapes VQAv2 Kinetics-700 RVL-CDIP PASCAL VOC Kinetics-400 ...

    You can read in a label file as follows (using… See the full description on the dataset page: https://huggingface.co/datasets/huggingface/label-files.

  15. Dollar street 10 - 64x64x3

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Apr 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sven van der burg; Sven van der burg (2024). Dollar street 10 - 64x64x3 [Dataset]. http://doi.org/10.5281/zenodo.10970014
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 14, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sven van der burg; Sven van der burg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The MLCommons Dollar Street Dataset is a collection of images of everyday household items from homes around the world that visually captures socioeconomic diversity of traditionally underrepresented populations. It consists of public domain data, licensed for academic, commercial and non-commercial usage, under CC-BY and CC-BY-SA 4.0. The dataset was developed because similar datasets lack socioeconomic metadata and are not representative of global diversity.

    This is a subset of the original dataset that can be used for multiclass classification with 10 categories. It is designed to be used in teaching, similar to the widely used, but unlicensed CIFAR-10 dataset.

    These are the preprocessing steps that were performed:

    1. Only take examples with one imagenet_synonym label
    2. Use only examples with the 10 most frequently occuring labels
    3. Downscale images to 64 x 64 pixels
    4. Split data in train and test
    5. Store as numpy array

    This is the label mapping:

    Categorylabel
    day bed0
    dishrag1
    plate2
    running shoe3
    soap dispenser4
    street sign5
    table lamp6
    tile roof7
    toilet seat8
    washing machine9

    Checkout this notebook to see how the subset was created.

    The original dataset was downloaded from https://www.kaggle.com/datasets/mlcommons/the-dollar-street-dataset. See https://mlcommons.org/datasets/dollar-street/ for more information.

  16. R

    Labeling Gun Dataset

    • universe.roboflow.com
    zip
    Updated Jan 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    training data knife (2023). Labeling Gun Dataset [Dataset]. https://universe.roboflow.com/training-data-knife/labeling-gun
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 12, 2023
    Dataset authored and provided by
    training data knife
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Gun Bounding Boxes
    Description

    Labeling Gun

    ## Overview
    
    Labeling Gun is a dataset for object detection tasks - it contains Gun annotations for 388 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  17. Z

    Kyoushi Log Data Set

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frank, Maximilian (2023). Kyoushi Log Data Set [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5779410
    Explore at:
    Dataset updated
    Oct 18, 2023
    Dataset provided by
    Skopik, Florian
    Frank, Maximilian
    Rauber, Andreas
    Landauer, Max
    Hotwagner, Wolfgang
    Wurzenberger, Markus
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This repository contains synthetic log data suitable for evaluation of intrusion detection systems. The logs were collected from a testbed that was built at the Austrian Institute of Technology (AIT) following the approaches by [1], [2], and [3]. Please refer to these papers for more detailed information on the dataset and cite them if the data is used for academic publications. Other than the related AIT-LDSv1.1, this dataset involves a more complex network structure, makes use of a different attack scenario, and collects log data from multiple hosts in the network. In brief, the testbed simulates a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise. After some days, two attack scenarios are launched against the network. Note that the AIT-LDSv2.0 extends this dataset with additional attack cases and variations of attack parameters.

    The archives have the following structure. The gather directory contains the raw log data from each host in the network, as well as their system configurations. The labels directory contains the ground truth for those log files that are labeled. The processing directory contains configurations for the labeling procedure and the rules directory contains the labeling rules. Labeling of events that are related to the attacks is carried out with the Kyoushi Labeling Framework.

    Each dataset contains traces of a specific attack scenario:

    Scenario 1 (see gather/attacker_0/logs/sm.log for detailed attack log):

    nmap scan

    WPScan

    dirb scan

    webshell upload through wpDiscuz exploit (CVE-2020-24186)

    privilege escalation

    Scenario 2 (see gather/attacker_0/logs/dnsteal.log for detailed attack log):

    DNSteal data exfiltration

    The log data collected from the servers includes

    Apache access and error logs (labeled)

    audit logs (labeled)

    auth logs (labeled)

    VPN logs (labeled)

    DNS logs (labeled)

    syslog

    suricata logs

    exim logs

    horde logs

    mail logs

    Note that only log files from affected servers are labeled. Label files and the directories in which they are located have the same name as their corresponding log file in the gather directory. Labels are in JSON format and comprise the following attributes: line (number of line in corresponding log file), labels (list of labels assigned to that log line), rules (names of labeling rules matching that log line). Note that not all attack traces are labeled in all log files; please refer to the labeling rules in case that some labels are not clear.

    Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU project GUARD (833456).

    If you use the dataset, please cite the following publications:

    [1] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317.

    [2] M. Landauer, M. Frank, F. Skopik, W. Hotwagner, M. Wurzenberger, and A. Rauber, "A Framework for Automatic Labeling of Log Datasets from Model-driven Testbeds for HIDS Evaluation". ACM Workshop on Secure and Trustworthy Cyber-Physical Systems (ACM SaT-CPS 2022), April 27, 2022, Baltimore, MD, USA. ACM.

    [3] M. Frank, "Quality improvement of labels for model-driven benchmark data generation for intrusion detection systems", Master's Thesis, Vienna University of Technology, 2021.

  18. m

    Human Faces and Objects Mix Image Dataset

    • data.mendeley.com
    Updated Mar 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bindu Garg (2025). Human Faces and Objects Mix Image Dataset [Dataset]. http://doi.org/10.17632/nzwvnrmwp3.1
    Explore at:
    Dataset updated
    Mar 13, 2025
    Authors
    Bindu Garg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description: Human Faces and Objects Dataset (HFO-5000) The Human Faces and Objects Dataset (HFO-5000) is a curated collection of 5,000 images, categorized into three distinct classes: male faces (1,500), female faces (1,500), and objects (2,000). This dataset is designed for machine learning and computer vision applications, including image classification, face detection, and object recognition. The dataset provides high-quality, labeled images with a structured CSV file for seamless integration into deep learning pipelines.

    Column Description: The dataset is accompanied by a CSV file that contains essential metadata for each image. The CSV file includes the following columns: file_name: The name of the image file (e.g., image_001.jpg). label: The category of the image, with three possible values: "male" (for male face images) "female" (for female face images) "object" (for images of various objects) file_path: The full or relative path to the image file within the dataset directory.

    Uniqueness and Key Features: 1) Balanced Distribution: The dataset maintains an even distribution of human faces (male and female) to minimize bias in classification tasks. 2) Diverse Object Selection: The object category consists of a wide variety of items, ensuring robustness in distinguishing between human and non-human entities. 3) High-Quality Images: The dataset consists of clear and well-defined images, suitable for both training and testing AI models. 4) Structured Annotations: The CSV file simplifies dataset management and integration into machine learning workflows. 5) Potential Use Cases: This dataset can be used for tasks such as gender classification, facial recognition benchmarking, human-object differentiation, and transfer learning applications.

    Conclusion: The HFO-5000 dataset provides a well-structured, diverse, and high-quality set of labeled images that can be used for various computer vision tasks. Its balanced distribution of human faces and objects ensures fairness in training AI models, making it a valuable resource for researchers and developers. By offering structured metadata and a wide range of images, this dataset facilitates advancements in deep learning applications related to facial recognition and object classification.

  19. m

    Data from: MLRSNet: A Multi-label High Spatial Resolution Remote Sensing...

    • data.mendeley.com
    Updated Sep 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaoman Qi (2023). MLRSNet: A Multi-label High Spatial Resolution Remote Sensing Dataset for Semantic Scene Understanding [Dataset]. http://doi.org/10.17632/7j9bv9vwsx.4
    Explore at:
    Dataset updated
    Sep 18, 2023
    Authors
    Xiaoman Qi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MLRSNet provides different perspectives of the world captured from satellites. That is, it is composed of high spatial resolution optical satellite images. MLRSNet contains 109,161 remote sensing images that are annotated into 46 categories, and the number of sample images in a category varies from 1,500 to 3,000. The images have a fixed size of 256×256 pixels with various pixel resolutions (~10m to 0.1m). Moreover, each image in the dataset is tagged with several of 60 predefined class labels, and the number of labels associated with each image varies from 1 to 13. The dataset can be used for multi-label based image classification, multi-label based image retrieval, and image segmentation.

    The Dataset includes: 1. Images folder: 46 categories, 109,161 high-spatial resolution remote sensing images. 2. Labels folders: each category has a .csv file. 3. Categories_names. xlsx: Sheet1 lists the names of 46 categories, and the Sheet2 shows the associated multi-label to each category.

  20. Image diagram dataset

    • figshare.com
    zip
    Updated Sep 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergio Andres Rodriguez Torres (2022). Image diagram dataset [Dataset]. http://doi.org/10.6084/m9.figshare.20399283.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 13, 2022
    Dataset provided by
    figshare
    Authors
    Sergio Andres Rodriguez Torres
    License

    https://www.apache.org/licenses/LICENSE-2.0.htmlhttps://www.apache.org/licenses/LICENSE-2.0.html

    Description

    A collection of 5.981 images labeled into 6 categories related to software diagrams, with the following distribution: Label Name Number 0 None 1010 1 Activity Diagram 595 2 Sequence Diagram 811 3 Class Diagram 986 4 Component Diagram 368 5 Use Case Diagram 854 6 Cloud Diagram 978 The dataset consist of a CSV file with the labeling and a zip file with the normalized images. The images are normalized in format RGB and size 224x224 pixels, ready for Keras neural networks.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dashlink (2023). Pseudo-Label Generation for Multi-Label Text Classification [Dataset]. https://catalog.data.gov/dataset/pseudo-label-generation-for-multi-label-text-classification

Data from: Pseudo-Label Generation for Multi-Label Text Classification

Related Article
Explore at:
Dataset updated
Dec 6, 2023
Dataset provided by
Dashlink
Description

With the advent and expansion of social networking, the amount of generated text data has seen a sharp increase. In order to handle such a huge volume of text data, new and improved text mining techniques are a necessity. One of the characteristics of text data that makes text mining difficult, is multi-labelity. In order to build a robust and effective text classification method which is an integral part of text mining research, we must consider this property more closely. This kind of property is not unique to text data as it can be found in non-text (e.g., numeric) data as well. However, in text data, it is most prevalent. This property also puts the text classification problem in the domain of multi-label classification (MLC), where each instance is associated with a subset of class-labels instead of a single class, as in conventional classification. In this paper, we explore how the generation of pseudo labels (i.e., combinations of existing class labels) can help us in performing better text classification and under what kind of circumstances. During the classification, the high and sparse dimensionality of text data has also been considered. Although, here we are proposing and evaluating a text classification technique, our main focus is on the handling of the multi-labelity of text data while utilizing the correlation among multiple labels existing in the data set. Our text classification technique is called pseudo-LSC (pseudo-Label Based Subspace Clustering). It is a subspace clustering algorithm that considers the high and sparse dimensionality as well as the correlation among different class labels during the classification process to provide better performance than existing approaches. Results on three real world multi-label data sets provide us insight into how the multi-labelity is handled in our classification process and shows the effectiveness of our approach.

Search
Clear search
Close search
Google apps
Main menu