CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
## Overview
Bottle Labels is a dataset for object detection tasks - it contains Labels annotations for 644 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The dataset consists of academic papers sourced from the ArXiv. It comprises a diverse range of papers covering topics such as computer science, AI, mathematics, and more. The dataset is preprocessed and annotated for multi-label classification, with each paper associated with one or more subject categories. The data collection process is also done and shown here. The dataset Arxiv34k6L contains abstracts and their categories. Readers can download and preprocess the data according to their own needs as shown in the collection step. There are two versions: 90K is not balanced whereas 34k is more balanced for simplicity.
Version 2: I have added training, test, and validation data for the 4 labels problem, for further simplicity.
This dataset is part of my project on GitHub here
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Product Label is a dataset for object detection tasks - it contains Products LjCv annotations for 211 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
This file contains the data elements used for searching the FDA Online Data Repository including proprietary name, active ingredients, marketing application number or regulatory citation, National Drug Code, and company name.
Generate train data for follow model
This repo will generate data of Follow-Lang/set.mm.label in huggingface.
Format
The data is located in datasets/train. Each line is formatted as: s label arguments. The maximum word length is 1024. All vocabulary words are listed in words.txt based on Follow-Lang/set.mm.proof. The data was generated with a depth of 2.
If you need additional data, feel free to reach out. This version improves readability and flow while maintaining the… See the full description on the dataset page: https://huggingface.co/datasets/Follow-Lang/set.mm.label.
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Dataset Card for wine-labels
** The original COCO dataset is stored at dataset.tar.gz**
Dataset Summary
wine-labels
Supported Tasks and Leaderboards
object-detection: The dataset can be used to train a model for Object Detection.
Languages
English
Dataset Structure
Data Instances
A data point comprises an image and its object annotations. { 'image_id': 15, 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB… See the full description on the dataset page: https://huggingface.co/datasets/Francesco/wine-labels.
## Overview
Data Labeling Task is a dataset for object detection tasks - it contains Hand annotations for 5,048 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context: Exception handling (EH) bugs stem from incorrect usage of exception handling mechanisms (EHMs) and often incur severe consequences (e.g., system downtime, data loss, and security risk). Tracking EH bugs is particularly relevant for contemporary systems (e.g., cloud- and AI-based systems), in which the software's sophisticated logic is an additional threat to the correct use of the EHM. On top of that, bug reporters seldom can tag EH bugs --- since it may require an encompassing knowledge of the software's EH strategy. Surprisingly, to the best of our knowledge, there is no automated procedure to identify EH bugs from report descriptions.Objective: First, we aim to evaluate the extent to which Natural Language Processing (NLP) and Machine Learning (ML) can be used to reliably label EH bugs using the text fields from bug reports (e.g., summary, description, and comments). Second, we aim to provide a reliably labeled dataset that the community can use in future endeavors. Overall, we expect our work to raise the community's awareness regarding the importance of EH bugs.Method: We manually analyzed 4,516 bug reports from the four main components of Apache’s Hadoop project, out of which we labeled ~20% (943) as EH bugs. We also labeled 2,584 non-EH bugs analyzing their bug-fixing code and creating a dataset composed of 7,100 bug reports. Then, we used word embedding techniques (Bag-of-Words and TF-IDF) to summarize the textual fields of bug reports. Subsequently, we used these embeddings to fit five classes of ML methods and evaluate them on unseen data. We also evaluated a pre-trained transformer-based model using the complete textual fields. We have also evaluated whether considering only EH keywords is enough to achieve high predictive performance.Results: Our results show that using a pre-trained DistilBERT with a linear layer trained with our proposed dataset can reasonably label EH bugs, achieving ROC-AUC scores of up to 0.88. The combination of NLP and ML traditional techniques achieved ROC-AUC scores of up to 0.74 and recall up to 0.56. As a sanity check, we also evaluate methods using embeddings extracted solely from keywords. Considering ROC-AUC as the primary concern, for the majority of ML methods tested, the analysis suggests that keywords alone are not sufficient to characterize reports of EH bugs, although this can change based on other metrics (such as recall and precision) or ML methods (e.g., Random Forest).Conclusions: To the best of our knowledge, this is the first study addressing the problem of automatic labeling of EH bugs. Based on our results, we can conclude that the use of ML techniques, specially transformer-base models, sounds promising to automate the task of labeling EH bugs. Overall, we hope (i) that our work will contribute towards raising awareness around EH bugs; and (ii) that our (publicly available) dataset will serve as a benchmarking dataset, paving the way for follow-up works. Additionally, our findings can be used to build tools that help maintainers flesh out EH bugs during the triage process.
With the advent and expansion of social networking, the amount of generated text data has seen a sharp increase. In order to handle such a huge volume of text data, new and improved text mining techniques are a necessity. One of the characteristics of text data that makes text mining difficult, is multi-labelity. In order to build a robust and effective text classification method which is an integral part of text mining research, we must consider this property more closely. This kind of property is not unique to text data as it can be found in non-text (e.g., numeric) data as well. However, in text data, it is most prevalent. This property also puts the text classification problem in the domain of multi-label classification (MLC), where each instance is associated with a subset of class-labels instead of a single class, as in conventional classification. In this paper, we explore how the generation of pseudo labels (i.e., combinations of existing class labels) can help us in performing better text classification and under what kind of circumstances. During the classification, the high and sparse dimensionality of text data has also been considered. Although, here we are proposing and evaluating a text classification technique, our main focus is on the handling of the multi-labelity of text data while utilizing the correlation among multiple labels existing in the data set. Our text classification technique is called pseudo-LSC (pseudo-Label Based Subspace Clustering). It is a subspace clustering algorithm that considers the high and sparse dimensionality as well as the correlation among different class labels during the classification process to provide better performance than existing approaches. Results on three real world multi-label data sets provide us insight into how the multi-labelity is handled in our classification process and shows the effectiveness of our approach.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This page contains a modified Cocos dataset along with details about the dataset used.
File Descriptions
imgs.zip - Train: 🚂 This folder contains the training set, which can be split into train/validation data for model training. - Test: 🧪 Your trained models should be used to produce predictions on the test set.
labels.zip - categories.csv: 📝 This file lists all the object classes in the dataset, ordered according to the column ordering in the train labels file. - train_labels.csv: 📊 This file contains data regarding which image contains which categories.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 2 rows and is filtered where the books is How to label a graph. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global data collection and labeling market size was USD 27.1 Billion in 2023 and is likely to reach USD 133.3 Billion by 2032, expanding at a CAGR of 22.4 % during 2024–2032. The market growth is attributed to the increasing demand for high-quality labeled datasets to train artificial intelligence and machine learning algorithms across various industries.
Growing adoption of AI in e-commerce is projected to drive the market in the assessment year. E-commerce platforms rely on high-quality images to showcase products effectively and improve the online shopping experience for customers. Accurately labeled images enable better product categorization and search optimization, driving higher conversion rates and customer engagement.
Rising adoption of AI in the financial sector is a significant factor boosting the need for data collection and labeling services for tasks such as fraud detection, risk assessment, and algorithmic trading. Financial institutions leverage labeled datasets to train AI models to analyze vast amounts of transactional data, identify patterns, and detect anomalies indicative of fraudulent activity.
The use of artificial intelligence is revolutionizing the way labeled datasets are created and utilized. With the advancements in AI technologies, such as computer vision and natural language processing, the demand for accurately labeled datasets has surged across various industries.
AI algorithms are increasingly being leveraged to automate and streamline the data labeling process, reducing the manual effort required and improving efficiency. For instance,
In April 2022, Encord, a startup, introduced its beta version of CordVision, an AI-assisted labeling application that inten
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
RCV1
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is designed for training and evaluating object detection models, specifically for detecting plastic bottles and classifying them based on the presence or absence of a label. It is structured to work seamlessly with YOLOv8 and follows the standard YOLO format.
🔍 Classes: 0: Bottle with Label
1: Bottle without Label
📁 Folder Structure: images/: Contains all image files
labels/: Corresponding YOLO-format annotation files
data.yaml: Configuration file for training with YOLOv8
🛠 Use Case: This dataset is ideal for real-time detection systems, quality control applications, recycling automation, and projects focused on object classification in cluttered or real-world environments.
penglingwei/follow-label-dataset-arrow dataset hosted on Hugging Face and contributed by the HF Datasets community
This dataset contains ILSVRC-2012 (ImageNet) validation images annotated with multi-class labels from "Evaluating Machine Accuracy on ImageNet", ICML, 2020. The multi-class labels were reviewed by a panel of experts extensively trained in the intricacies of fine-grained class distinctions in the ImageNet class hierarchy (see paper for more details). Compared to the original labels, these expert-reviewed multi-class labels enable a more semantically coherent evaluation of accuracy.
Version 3.0.0 of this dataset contains more corrected labels from "When does dough become a bagel? Analyzing the remaining mistakes on ImageNet as well as the ImageNet-Major (ImageNet-M) 68-example split under 'imagenet-m'.
Only 20,000 of the 50,000 ImageNet validation images have multi-label
annotations. The set of multi-labels was first generated by a testbed of 67
trained ImageNet models, and then each individual model prediction was manually
annotated by the experts as either correct
(the label is correct for the
image),wrong
(the label is incorrect for the image), or unclear
(no
consensus was reached among the experts).
Additionally, during annotation, the expert panel identified a set of problematic images. An image was problematic if it met any of the below criteria:
The problematic images are included in this dataset but should be ignored when computing multi-label accuracy. Additionally, since the initial set of 20,000 annotations is class-balanced, but the set of problematic images is not, we recommend computing the per-class accuracies and then averaging them. We also recommend counting a prediction as correct if it is marked as correct or unclear (i.e., being lenient with the unclear labels).
One possible way of doing this is with the following NumPy code:
import tensorflow_datasets as tfds
ds = tfds.load('imagenet2012_multilabel', split='validation')
# We assume that predictions is a dictionary from file_name to a class index between 0 and 999
num_correct_per_class = {}
num_images_per_class = {}
for example in ds:
# We ignore all problematic images
if example[‘is_problematic’].numpy():
continue
# The label of the image in ImageNet
cur_class = example['original_label'].numpy()
# If we haven't processed this class yet, set the counters to 0
if cur_class not in num_correct_per_class:
num_correct_per_class[cur_class] = 0
assert cur_class not in num_images_per_class
num_images_per_class[cur_class] = 0
num_images_per_class[cur_class] += 1
# Get the predictions for this image
cur_pred = predictions[example['file_name'].numpy()]
# We count a prediction as correct if it is marked as correct or unclear
# (i.e., we are lenient with the unclear labels)
if cur_pred is in example['correct_multi_labels'].numpy() or cur_pred is in example['unclear_multi_labels'].numpy():
num_correct_per_class[cur_class] += 1
# Check that we have collected accuracy data for each of the 1,000 classes
num_classes = 1000
assert len(num_correct_per_class) == num_classes
assert len(num_images_per_class) == num_classes
# Compute the per-class accuracies and then average them
final_avg = 0
for cid in range(num_classes):
assert cid in num_correct_per_class
assert cid in num_images_per_class
final_avg += num_correct_per_class[cid] / num_images_per_class[cid]
final_avg /= num_classes
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('imagenet2012_multilabel', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/imagenet2012_multilabel-3.0.0.png" alt="Visualization" width="500px">
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multi-Label Web Page Classification Dataset
Dataset Description
The Multi-Label Web Page Classification Dataset is a curated dataset containingweb page titles and snippets, extracted from the CC-Meta25-1M dataset. Each entry has been automatically categorized into multiple predefined categories using ChatGPT-4o-mini. This dataset is designed for multi-label text classification tasks, making it ideal for training and evaluating machine learning models in web content… See the full description on the dataset page: https://huggingface.co/datasets/tshasan/multi-label-web-categorization.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Retail Automation: Implement the "labeling" computer vision model in retail stores to automatically identify and classify the products on shelves for real-time inventory tracking, shelf management, and quick restocking of products.
Automated Checkout Systems: Use the "labeling" model in cashier-less stores or self-checkout machines, allowing customers to simply place their products on a shelf or table for the system to recognize and process the items without needing to scan individual barcodes.
Product Recommendation System: Integrate the "labeling" model into a recommendation engine, suggesting similar or complementary products to customers based on shopping patterns, product relations, and the customers' current items in their cart or hand.
Nutritional Information and Allergen Warnings: By identifying the specific snacks or food items, the "labeling" model can aid users in finding nutritional information, ingredients lists, and potential allergen warnings for each product in real-time through a mobile app or in-store display.
Product Recognition-based Marketing Campaigns: Harness the computer vision model to develop interactive marketing campaigns, such as a treasure hunt, where participants need to find targeted products by their image recognition, increasing customer engagement and brand awareness.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
subjective evaluation
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
## Overview
Label Data is a dataset for object detection tasks - it contains Waste QnrU annotations for 584 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [MIT license](https://creativecommons.org/licenses/MIT).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
## Overview
Bottle Labels is a dataset for object detection tasks - it contains Labels annotations for 644 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).