100+ datasets found

d
Data from: Pseudo-Label Generation for Multi-Label Text Classification
catalog.data.gov
datasets.ai
+2more
Updated Dec 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2023). Pseudo-Label Generation for Multi-Label Text Classification [Dataset]. https://catalog.data.gov/dataset/pseudo-label-generation-for-multi-label-text-classification
Explore at:
Dataset updated
Dec 6, 2023
Dataset provided by
Dashlink
Description
With the advent and expansion of social networking, the amount of generated text data has seen a sharp increase. In order to handle such a huge volume of text data, new and improved text mining techniques are a necessity. One of the characteristics of text data that makes text mining difficult, is multi-labelity. In order to build a robust and effective text classification method which is an integral part of text mining research, we must consider this property more closely. This kind of property is not unique to text data as it can be found in non-text (e.g., numeric) data as well. However, in text data, it is most prevalent. This property also puts the text classification problem in the domain of multi-label classification (MLC), where each instance is associated with a subset of class-labels instead of a single class, as in conventional classification. In this paper, we explore how the generation of pseudo labels (i.e., combinations of existing class labels) can help us in performing better text classification and under what kind of circumstances. During the classification, the high and sparse dimensionality of text data has also been considered. Although, here we are proposing and evaluating a text classification technique, our main focus is on the handling of the multi-labelity of text data while utilizing the correlation among multiple labels existing in the data set. Our text classification technique is called pseudo-LSC (pseudo-Label Based Subspace Clustering). It is a subspace clustering algorithm that considers the high and sparse dimensionality as well as the correlation among different class labels during the classification process to provide better performance than existing approaches. Results on three real world multi-label data sets provide us insight into how the multi-labelity is handled in our classification process and shows the effectiveness of our approach.
Data from: Towards Automatic Labeling of Exception Handling Bugs: A Case...
figshare.com
zip
Updated Apr 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Renan Vieira (2024). Towards Automatic Labeling of Exception Handling Bugs: A Case Study of 10 Years Bug-Fixing in Apache Hadoop [Dataset]. http://doi.org/10.6084/m9.figshare.22735124.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22735124.v2
Dataset updated
Apr 29, 2024
Dataset provided by
figshare
Authors
Renan Vieira
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context: Exception handling (EH) bugs stem from incorrect usage of exception handling mechanisms (EHMs) and often incur severe consequences (e.g., system downtime, data loss, and security risk). Tracking EH bugs is particularly relevant for contemporary systems (e.g., cloud- and AI-based systems), in which the software's sophisticated logic is an additional threat to the correct use of the EHM. On top of that, bug reporters seldom can tag EH bugs --- since it may require an encompassing knowledge of the software's EH strategy. Surprisingly, to the best of our knowledge, there is no automated procedure to identify EH bugs from report descriptions.Objective: First, we aim to evaluate the extent to which Natural Language Processing (NLP) and Machine Learning (ML) can be used to reliably label EH bugs using the text fields from bug reports (e.g., summary, description, and comments). Second, we aim to provide a reliably labeled dataset that the community can use in future endeavors. Overall, we expect our work to raise the community's awareness regarding the importance of EH bugs.Method: We manually analyzed 4,516 bug reports from the four main components of Apache’s Hadoop project, out of which we labeled ~20% (943) as EH bugs. We also labeled 2,584 non-EH bugs analyzing their bug-fixing code and creating a dataset composed of 7,100 bug reports. Then, we used word embedding techniques (Bag-of-Words and TF-IDF) to summarize the textual fields of bug reports. Subsequently, we used these embeddings to fit five classes of ML methods and evaluate them on unseen data. We also evaluated a pre-trained transformer-based model using the complete textual fields. We have also evaluated whether considering only EH keywords is enough to achieve high predictive performance.Results: Our results show that using a pre-trained DistilBERT with a linear layer trained with our proposed dataset can reasonably label EH bugs, achieving ROC-AUC scores of up to 0.88. The combination of NLP and ML traditional techniques achieved ROC-AUC scores of up to 0.74 and recall up to 0.56. As a sanity check, we also evaluate methods using embeddings extracted solely from keywords. Considering ROC-AUC as the primary concern, for the majority of ML methods tested, the analysis suggests that keywords alone are not sufficient to characterize reports of EH bugs, although this can change based on other metrics (such as recall and precision) or ML methods (e.g., Random Forest).Conclusions: To the best of our knowledge, this is the first study addressing the problem of automatic labeling of EH bugs. Based on our results, we can conclude that the use of ML techniques, specially transformer-base models, sounds promising to automate the task of labeling EH bugs. Overall, we hope (i) that our work will contribute towards raising awareness around EH bugs; and (ii) that our (publicly available) dataset will serve as a benchmarking dataset, paving the way for follow-up works. Additionally, our findings can be used to build tools that help maintainers flesh out EH bugs during the triage process.
D
Data Labeling Solution and Services Report
archivemarketresearch.com
doc, pdf, ppt
Updated Mar 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AMA Research & Media LLP (2025). Data Labeling Solution and Services Report [Dataset]. https://www.archivemarketresearch.com/reports/data-labeling-solution-and-services-52811
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Mar 7, 2025
Dataset provided by
AMA Research & Media LLP
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Data Labeling Solutions and Services market is experiencing robust growth, driven by the escalating demand for high-quality training data in the artificial intelligence (AI) and machine learning (ML) sectors. The market, estimated at $15 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033, reaching approximately $75 billion by 2033. This expansion is fueled by several key factors. Firstly, the increasing adoption of AI across diverse industries, including automotive, healthcare, and finance, necessitates vast amounts of accurately labeled data for model training and improvement. Secondly, advancements in deep learning algorithms and the emergence of sophisticated data annotation tools are streamlining the labeling process, boosting efficiency and reducing costs. Finally, the growing availability of diverse data sources, coupled with the rise of specialized data labeling companies, is further contributing to market growth. Despite these positive trends, the market faces some challenges. The high cost associated with data annotation, particularly for complex datasets requiring specialized expertise, can be a barrier for smaller businesses. Ensuring data quality and consistency across large-scale projects remains a critical concern, necessitating robust quality control measures. Furthermore, addressing data privacy and security issues is essential to maintain ethical standards and build trust within the market. The market segmentation by type (text, image/video, audio) and application (automotive, government, healthcare, financial services, etc.) presents significant opportunities for specialized service providers catering to niche needs. Competition is expected to intensify as new players enter the market, focusing on innovative solutions and specialized services.
Lego Part Label Classification Dataset
universe.roboflow.com
zip
Updated Apr 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PDP lego classification (2023). Lego Part Label Classification Dataset [Dataset]. https://universe.roboflow.com/pdp-lego-classification/lego-part-label-classification
Explore at:
zipAvailable download formats
Dataset updated
Apr 25, 2023
Dataset provided by
The Lego Grouphttp://lego.com/
Authors
PDP lego classification
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Lego
Description
Lego Part Label Classification

## Overview Lego Part Label Classification is a dataset for classification tasks - it contains Lego annotations for 301 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Multi-Label Classification Dataset
kaggle.com
Updated Jan 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shivanand (2021). Multi-Label Classification Dataset [Dataset]. https://www.kaggle.com/datasets/shivanandmn/multilabel-classification-dataset/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 28, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shivanand
Description
Context

NLP: Multi-label Classification Dataset.

Content

The dataset contains 6 different labels(Computer Science, Physics, Mathematics, Statistics, Quantitative Biology, Quantitative Finance) to classify the research papers based on Abstract and Title. The value 1 in label columns represents that label belongs to that paper. Each paper has multiple labels as 1.

Acknowledgements

This dataset is from Analytics Vidhya Hackthon

Inspiration

Do you solve it to get the best score?
Z
Data from: IoT-23: A labeled dataset with malicious and benign IoT network...
data.niaid.nih.gov
zenodo.org
Updated Sep 3, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agustin Parmisano (2021). IoT-23: A labeled dataset with malicious and benign IoT network traffic [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4743745
Explore at:
Dataset updated
Sep 3, 2021
Dataset provided by
Sebastian Garcia
Agustin Parmisano
Maria Jose Erquiaga
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IoT-23 is a dataset of network traffic from Internet of Things (IoT) devices. It has 20 malware captures executed in IoT devices, and 3 captures for benign IoT devices traffic. It was first published in January 2020, with captures ranging from 2018 to 2019. These IoT network traffic was captured in the Stratosphere Laboratory, AIC group, FEL, CTU University, Czech Republic. Its goal is to offer a large dataset of real and labeled IoT malware infections and IoT benign traffic for researchers to develop machine learning algorithms. This dataset and its research was funded by Avast Software. The malware was allow to connect to the Internet.
Wine Labels Dataset
universe.roboflow.com
zip
Updated May 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roboflow 100 (2023). Wine Labels Dataset [Dataset]. https://universe.roboflow.com/roboflow-100/wine-labels
Explore at:
zipAvailable download formats
Dataset updated
May 7, 2023
Dataset provided by
Roboflow
Authors
Roboflow 100
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Wine Labels Bounding Boxes
Description
This dataset was originally created by Yilong Zheng. To see the current project, which may have been updated since this version, please go here: https://universe.roboflow.com/wine-label/wine-label-detection.

This dataset is part of RF100, an Intel-sponsored initiative to create a new object detection benchmark for model generalizability.

Access the RF100 Github repo: https://github.com/roboflow-ai/roboflow-100-benchmark
MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE...
data.staging.idas-ds1.appdat.jsc.nasa.gov
data.nasa.gov
+1more
Updated Feb 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/multi-label-asrs-dataset-classification-using-semi-supervised-subspace-clustering
Explore at:
Dataset updated
Feb 19, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING MOHAMMAD SALIM AHMED, LATIFUR KHAN, NIKUNJ OZA, AND MANDAVA RAJESWARI Abstract. There has been a lot of research targeting text classification. Many of them focus on a particular characteristic of text data - multi-labelity. This arises due to the fact that a document may be associated with multiple classes at the same time. The consequence of such a characteristic is the low performance of traditional binary or multi-class classification techniques on multi-label text data. In this paper, we propose a text classification technique that considers this characteristic and provides very good performance. Our multi-label text classification approach is an extension of our previously formulated [3] multi-class text classification approach called SISC (Semi-supervised Impurity based Subspace Clustering). We call this new classification model as SISC-ML(SISC Multi-Label). Empirical evaluation on real world multi-label NASA ASRS (Aviation Safety Reporting System) data set reveals that our approach outperforms state-of-theart text classification as well as subspace clustering algorithms.
Quantitative Content Analysis Data for Hand Labeling Road Surface Conditions...
zenodo-rdm.web.cern.ch
data.niaid.nih.gov
zip
Updated Sep 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carly Sutter; Carly Sutter; Kara Sulia; Kara Sulia; Nick P. Bassill; Nick P. Bassill; Christopher D. Thorncroft; Christopher D. Wirz; Christopher D. Wirz; Vanessa Przybylo; Vanessa Przybylo; Mariana G. Cains; Mariana G. Cains; Jacob Radford; Jacob Radford; David Aaron Evans; David Aaron Evans; Christopher D. Thorncroft (2023). Quantitative Content Analysis Data for Hand Labeling Road Surface Conditions in New York State Department of Transportation Camera Images [Dataset]. http://doi.org/10.5281/zenodo.8370665
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8370665
Dataset updated
Sep 27, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Carly Sutter; Carly Sutter; Kara Sulia; Kara Sulia; Nick P. Bassill; Nick P. Bassill; Christopher D. Thorncroft; Christopher D. Wirz; Christopher D. Wirz; Vanessa Przybylo; Vanessa Przybylo; Mariana G. Cains; Mariana G. Cains; Jacob Radford; Jacob Radford; David Aaron Evans; David Aaron Evans; Christopher D. Thorncroft
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
New York
Description
Traffic camera images from the New York State Department of Transportation (511ny.org) are used to create a hand-labeled dataset of images classified into to one of six road surface conditions: 1) severe snow, 2) snow, 3) wet, 4) dry, 5) poor visibility, or 6) obstructed. Six labelers (authors Sutter, Wirz, Przybylo, Cains, Radford, and Evans) went through a series of four labeling trials where reliability across all six labelers were assessed using the Krippendorff’s alpha (KA) metric (Krippendorff, 2007). The online tool by Dr. Freelon (Freelon, 2013; Freelon, 2010) was used to calculate reliability metrics after each trial, and the group achieved inter-coder reliability with KA of 0.888 on the 4th trial. This process is known as quantitative content analysis, and three pieces of data used in this process are shared, including: 1) a PDF of the codebook which serves as a set of rules for labeling images, 2) images from each of the four labeling trials, including the use of New York State Mesonet weather observation data (Brotzge et al., 2020), and 3) an Excel spreadsheet including the calculated inter-coder reliability metrics and other summaries used to asses reliability after each trial.

The broader purpose of this work is that the six human labelers, after achieving inter-coder reliability, can then label large sets of images independently, each contributing to the creation of larger labeled dataset used for training supervised machine learning models to predict road surface conditions from camera images. The xCITE lab (xCITE, 2023) is used to store camera images from 511ny.org, and the lab provides computing resources for training machine learning models.
Git labeled dataset of image diagrams
figshare.com
txt
Updated Sep 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergio Andres Rodriguez Torres (2022). Git labeled dataset of image diagrams [Dataset]. http://doi.org/10.6084/m9.figshare.20400999.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20400999.v2
Dataset updated
Sep 13, 2022
Dataset provided by
figshare
Authors
Sergio Andres Rodriguez Torres
License
https://www.apache.org/licenses/LICENSE-2.0.htmlhttps://www.apache.org/licenses/LICENSE-2.0.html
Description
Datset of 3'960.877 images build from GitHub public repositories. This dataset contains a column product of the automatic classification process of a machine learning convolutional network, with 6 posible categories related to software diagrams. Label Name
0 None
1 Activity Diagram
2 Sequence Diagram
3 Class Diagram
4 Component Diagram
5 Use Case Diagram
6 Cloud Diagram
It also includes information on the repository from which it was extracted.
d
TagX Data Annotation | Automated Annotation | AI-assisted labeling with...
datarade.ai
Updated Aug 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TagX (2022). TagX Data Annotation | Automated Annotation | AI-assisted labeling with human verification | Customized annotation | Data for AI & LLMs [Dataset]. https://datarade.ai/data-products/data-annotation-services-for-artificial-intelligence-and-data-tagx
Explore at:
.json, .xml, .csv, .xls, .txtAvailable download formats
Dataset updated
Aug 14, 2022
Dataset authored and provided by
TagX
Area covered
Sint Eustatius and Saba, Saint Barthélemy, Egypt, Estonia, Lesotho, Central African Republic, Comoros, Guatemala, Georgia, Cabo Verde
Description
TagX data annotation services are a set of tools and processes used to accurately label and classify large amounts of data for use in machine learning and artificial intelligence applications. The services are designed to be highly accurate, efficient, and customizable, allowing for a wide range of data types and use cases.

The process typically begins with a team of trained annotators reviewing and categorizing the data, using a variety of annotation tools and techniques, such as text classification, image annotation, and video annotation. The annotators may also use natural language processing and other advanced techniques to extract relevant information and context from the data.

Once the data has been annotated, it is then validated and checked for accuracy by a team of quality assurance specialists. Any errors or inconsistencies are corrected, and the data is then prepared for use in machine learning and AI models.

TagX annotation services can be applied to a wide range of data types, including text, images, videos, and audio. The services can be customized to meet the specific needs of each client, including the type of data, the level of annotation required, and the desired level of accuracy.

TagX data annotation services provide a powerful and efficient way to prepare large amounts of data for use in machine learning and AI applications, allowing organizations to extract valuable insights and improve their decision-making processes.
R
Label Data Dataset
universe.roboflow.com
zip
Updated Mar 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KHKT 20252 (2025). Label Data Dataset [Dataset]. https://universe.roboflow.com/khkt-20252/label-data-khcp8/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Mar 5, 2025
Dataset authored and provided by
KHKT 20252
Variables measured
Objects Polygons
Description
Label Data

## Overview Label Data is a dataset for instance segmentation tasks - it contains Objects annotations for 1,955 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
s
Rail Line Labeling Dataset
hr.shaip.com
maadaa.ai
+69more
json
Updated Dec 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2024). Rail Line Labeling Dataset [Dataset]. https://hr.shaip.com/offerings/machine-industry-datasets/
Explore at:
jsonAvailable download formats
Dataset updated
Dec 25, 2024
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The Rail Line Labeling Dataset is tailored for industrial applications, featuring a collection of internet-collected images with a resolution of 1920 x 1080 pixels. This dataset specializes in the detailed labeling of rail lines, including their turns and merges, using polygon annotations. Additionally, trains within these images are labeled with bounding boxes. The dataset specifically focuses on rail networks collected from Wuhan, providing a localized context for rail line analysis and train detection.
Data from: label-files
huggingface.co
Updated Dec 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
label-files [Dataset]. https://huggingface.co/datasets/huggingface/label-files
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 23, 2021
Dataset authored and provided by
Hugging Facehttps://huggingface.co/
Description
This repository contains the mapping from integer id's to actual label names (in HuggingFace Transformers typically called id2label) for several datasets. Current datasets include:

ImageNet-1k ImageNet-22k (also called ImageNet-21k as there are 21,843 classes) COCO detection 2017 COCO panoptic 2017 ADE20k (actually, the MIT Scene Parsing benchmark, which is a subset of ADE20k) Cityscapes VQAv2 Kinetics-700 RVL-CDIP PASCAL VOC Kinetics-400 ...

You can read in a label file as follows (using… See the full description on the dataset page: https://huggingface.co/datasets/huggingface/label-files.
Dollar street 10 - 64x64x3
zenodo.org
data.niaid.nih.gov
bin
Updated Apr 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sven van der burg; Sven van der burg (2024). Dollar street 10 - 64x64x3 [Dataset]. http://doi.org/10.5281/zenodo.10970014
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10970014
Dataset updated
Apr 14, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sven van der burg; Sven van der burg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The MLCommons Dollar Street Dataset is a collection of images of everyday household items from homes around the world that visually captures socioeconomic diversity of traditionally underrepresented populations. It consists of public domain data, licensed for academic, commercial and non-commercial usage, under CC-BY and CC-BY-SA 4.0. The dataset was developed because similar datasets lack socioeconomic metadata and are not representative of global diversity.

This is a subset of the original dataset that can be used for multiclass classification with 10 categories. It is designed to be used in teaching, similar to the widely used, but unlicensed CIFAR-10 dataset.

These are the preprocessing steps that were performed:

Only take examples with one imagenet_synonym label

Use only examples with the 10 most frequently occuring labels

Downscale images to 64 x 64 pixels

Split data in train and test

Store as numpy array

This is the label mapping:

Category label
day bed 0
dishrag 1
plate 2
running shoe 3
soap dispenser 4
street sign 5
table lamp 6
tile roof 7
toilet seat 8
washing machine 9

Checkout this notebook to see how the subset was created.

The original dataset was downloaded from https://www.kaggle.com/datasets/mlcommons/the-dollar-street-dataset. See https://mlcommons.org/datasets/dollar-street/ for more information.
R
Labeling Gun Dataset
universe.roboflow.com
zip
Updated Jan 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
training data knife (2023). Labeling Gun Dataset [Dataset]. https://universe.roboflow.com/training-data-knife/labeling-gun
Explore at:
zipAvailable download formats
Dataset updated
Jan 12, 2023
Dataset authored and provided by
training data knife
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Gun Bounding Boxes
Description
Labeling Gun

## Overview Labeling Gun is a dataset for object detection tasks - it contains Gun annotations for 388 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Z
Kyoushi Log Data Set
data.niaid.nih.gov
zenodo.org
Updated Oct 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frank, Maximilian (2023). Kyoushi Log Data Set [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5779410
Explore at:
Dataset updated
Oct 18, 2023
Dataset provided by
Skopik, Florian
Frank, Maximilian
Rauber, Andreas
Landauer, Max
Hotwagner, Wolfgang
Wurzenberger, Markus
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This repository contains synthetic log data suitable for evaluation of intrusion detection systems. The logs were collected from a testbed that was built at the Austrian Institute of Technology (AIT) following the approaches by [1], [2], and [3]. Please refer to these papers for more detailed information on the dataset and cite them if the data is used for academic publications. Other than the related AIT-LDSv1.1, this dataset involves a more complex network structure, makes use of a different attack scenario, and collects log data from multiple hosts in the network. In brief, the testbed simulates a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise. After some days, two attack scenarios are launched against the network. Note that the AIT-LDSv2.0 extends this dataset with additional attack cases and variations of attack parameters.

The archives have the following structure. The gather directory contains the raw log data from each host in the network, as well as their system configurations. The labels directory contains the ground truth for those log files that are labeled. The processing directory contains configurations for the labeling procedure and the rules directory contains the labeling rules. Labeling of events that are related to the attacks is carried out with the Kyoushi Labeling Framework.

Each dataset contains traces of a specific attack scenario:

Scenario 1 (see gather/attacker_0/logs/sm.log for detailed attack log):

nmap scan

WPScan

dirb scan

webshell upload through wpDiscuz exploit (CVE-2020-24186)

privilege escalation

Scenario 2 (see gather/attacker_0/logs/dnsteal.log for detailed attack log):

DNSteal data exfiltration

The log data collected from the servers includes

Apache access and error logs (labeled)

audit logs (labeled)

auth logs (labeled)

VPN logs (labeled)

DNS logs (labeled)

syslog

suricata logs

exim logs

horde logs

mail logs

Note that only log files from affected servers are labeled. Label files and the directories in which they are located have the same name as their corresponding log file in the gather directory. Labels are in JSON format and comprise the following attributes: line (number of line in corresponding log file), labels (list of labels assigned to that log line), rules (names of labeling rules matching that log line). Note that not all attack traces are labeled in all log files; please refer to the labeling rules in case that some labels are not clear.

Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU project GUARD (833456).

If you use the dataset, please cite the following publications:

[1] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317.

[2] M. Landauer, M. Frank, F. Skopik, W. Hotwagner, M. Wurzenberger, and A. Rauber, "A Framework for Automatic Labeling of Log Datasets from Model-driven Testbeds for HIDS Evaluation". ACM Workshop on Secure and Trustworthy Cyber-Physical Systems (ACM SaT-CPS 2022), April 27, 2022, Baltimore, MD, USA. ACM.

[3] M. Frank, "Quality improvement of labels for model-driven benchmark data generation for intrusion detection systems", Master's Thesis, Vienna University of Technology, 2021.
m
Human Faces and Objects Mix Image Dataset
data.mendeley.com
Updated Mar 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bindu Garg (2025). Human Faces and Objects Mix Image Dataset [Dataset]. http://doi.org/10.17632/nzwvnrmwp3.1
Explore at:
Unique identifier
https://doi.org/10.17632/nzwvnrmwp3.1
Dataset updated
Mar 13, 2025
Authors
Bindu Garg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description: Human Faces and Objects Dataset (HFO-5000) The Human Faces and Objects Dataset (HFO-5000) is a curated collection of 5,000 images, categorized into three distinct classes: male faces (1,500), female faces (1,500), and objects (2,000). This dataset is designed for machine learning and computer vision applications, including image classification, face detection, and object recognition. The dataset provides high-quality, labeled images with a structured CSV file for seamless integration into deep learning pipelines.

Column Description: The dataset is accompanied by a CSV file that contains essential metadata for each image. The CSV file includes the following columns: file_name: The name of the image file (e.g., image_001.jpg). label: The category of the image, with three possible values: "male" (for male face images) "female" (for female face images) "object" (for images of various objects) file_path: The full or relative path to the image file within the dataset directory.

Uniqueness and Key Features: 1) Balanced Distribution: The dataset maintains an even distribution of human faces (male and female) to minimize bias in classification tasks. 2) Diverse Object Selection: The object category consists of a wide variety of items, ensuring robustness in distinguishing between human and non-human entities. 3) High-Quality Images: The dataset consists of clear and well-defined images, suitable for both training and testing AI models. 4) Structured Annotations: The CSV file simplifies dataset management and integration into machine learning workflows. 5) Potential Use Cases: This dataset can be used for tasks such as gender classification, facial recognition benchmarking, human-object differentiation, and transfer learning applications.

Conclusion: The HFO-5000 dataset provides a well-structured, diverse, and high-quality set of labeled images that can be used for various computer vision tasks. Its balanced distribution of human faces and objects ensures fairness in training AI models, making it a valuable resource for researchers and developers. By offering structured metadata and a wide range of images, this dataset facilitates advancements in deep learning applications related to facial recognition and object classification.
m
Data from: MLRSNet: A Multi-label High Spatial Resolution Remote Sensing...
data.mendeley.com
Updated Sep 18, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiaoman Qi (2023). MLRSNet: A Multi-label High Spatial Resolution Remote Sensing Dataset for Semantic Scene Understanding [Dataset]. http://doi.org/10.17632/7j9bv9vwsx.4
Explore at:
Unique identifier
https://doi.org/10.17632/7j9bv9vwsx.4
Dataset updated
Sep 18, 2023
Authors
Xiaoman Qi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MLRSNet provides different perspectives of the world captured from satellites. That is, it is composed of high spatial resolution optical satellite images. MLRSNet contains 109,161 remote sensing images that are annotated into 46 categories, and the number of sample images in a category varies from 1,500 to 3,000. The images have a fixed size of 256×256 pixels with various pixel resolutions (~10m to 0.1m). Moreover, each image in the dataset is tagged with several of 60 predefined class labels, and the number of labels associated with each image varies from 1 to 13. The dataset can be used for multi-label based image classification, multi-label based image retrieval, and image segmentation.

The Dataset includes: 1. Images folder: 46 categories, 109,161 high-spatial resolution remote sensing images. 2. Labels folders: each category has a .csv file. 3. Categories_names. xlsx: Sheet1 lists the names of 46 categories, and the Sheet2 shows the associated multi-label to each category.
Image diagram dataset
figshare.com
zip
Updated Sep 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergio Andres Rodriguez Torres (2022). Image diagram dataset [Dataset]. http://doi.org/10.6084/m9.figshare.20399283.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20399283.v2
Dataset updated
Sep 13, 2022
Dataset provided by
figshare
Authors
Sergio Andres Rodriguez Torres
License
https://www.apache.org/licenses/LICENSE-2.0.htmlhttps://www.apache.org/licenses/LICENSE-2.0.html
Description
A collection of 5.981 images labeled into 6 categories related to software diagrams, with the following distribution: Label Name Number 0 None 1010 1 Activity Diagram 595 2 Sequence Diagram 811 3 Class Diagram 986 4 Component Diagram 368 5 Use Case Diagram 854 6 Cloud Diagram 978 The dataset consist of a CSV file with the labeling and a zip file with the normalized images. The images are normalized in format RGB and size 224x224 pixels, ready for Keras neural networks.

Facebook

Twitter

Click to copy link

Link copied

Cite

Dashlink (2023). Pseudo-Label Generation for Multi-Label Text Classification [Dataset]. https://catalog.data.gov/dataset/pseudo-label-generation-for-multi-label-text-classification

Data from: Pseudo-Label Generation for Multi-Label Text Classification

Explore at:

Dataset updated

Dec 6, 2023

Dataset provided by

Dashlink

Description

With the advent and expansion of social networking, the amount of generated text data has seen a sharp increase. In order to handle such a huge volume of text data, new and improved text mining techniques are a necessity. One of the characteristics of text data that makes text mining difficult, is multi-labelity. In order to build a robust and effective text classification method which is an integral part of text mining research, we must consider this property more closely. This kind of property is not unique to text data as it can be found in non-text (e.g., numeric) data as well. However, in text data, it is most prevalent. This property also puts the text classification problem in the domain of multi-label classification (MLC), where each instance is associated with a subset of class-labels instead of a single class, as in conventional classification. In this paper, we explore how the generation of pseudo labels (i.e., combinations of existing class labels) can help us in performing better text classification and under what kind of circumstances. During the classification, the high and sparse dimensionality of text data has also been considered. Although, here we are proposing and evaluating a text classification technique, our main focus is on the handling of the multi-labelity of text data while utilizing the correlation among multiple labels existing in the data set. Our text classification technique is called pseudo-LSC (pseudo-Label Based Subspace Clustering). It is a subspace clustering algorithm that considers the high and sparse dimensionality as well as the correlation among different class labels during the classification process to provide better performance than existing approaches. Results on three real world multi-label data sets provide us insight into how the multi-labelity is handled in our classification process and shows the effectiveness of our approach.

Clear search

Close search

Google apps

Main menu

Category	label
day bed	0
dishrag	1
plate	2
running shoe	3
soap dispenser	4
street sign	5
table lamp	6
tile roof	7
toilet seat	8
washing machine	9

Data from: Pseudo-Label Generation for Multi-Label Text Classification

Data from: Towards Automatic Labeling of Exception Handling Bugs: A Case...

Data Labeling Solution and Services Report

Lego Part Label Classification Dataset

Lego Part Label Classification

Multi-Label Classification Dataset

Context

Content

Acknowledgements

Inspiration

Data from: IoT-23: A labeled dataset with malicious and benign IoT network...

Wine Labels Dataset

MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE...

Quantitative Content Analysis Data for Hand Labeling Road Surface Conditions...

Git labeled dataset of image diagrams

TagX Data Annotation | Automated Annotation | AI-assisted labeling with...

Label Data Dataset

Label Data

Rail Line Labeling Dataset

Data from: label-files

Dollar street 10 - 64x64x3

Labeling Gun Dataset

Labeling Gun

Kyoushi Log Data Set

Human Faces and Objects Mix Image Dataset

Data from: MLRSNet: A Multi-label High Spatial Resolution Remote Sensing...

Image diagram dataset

Data from: Pseudo-Label Generation for Multi-Label Text ClassificationSee More Versions

Data from: Pseudo-Label Generation for Multi-Label Text Classification