Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Tree Classification My Project is a dataset for classification tasks - it contains Trees Class annotations for 554 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Classify Project is a dataset for classification tasks - it contains Disease annotations for 675 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Trash Classification Project is a dataset for object detection tasks - it contains Trash annotations for 404 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Replication package related to the paper "Into the ML-universe: An Improved Classification and Characterization of Machine-Learning Projects" which includes the results of the various steps of our study with related plots, and the tool we built to classify our projects.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Classification Project is a dataset for classification tasks - it contains Cells annotations for 3,069 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Location Classification Dataset
Dataset Summary
This dataset contains images extracted from videos for scene classification into 4 categories:
Cafe Gym Library Outdoor
Purpose
The dataset was created as part of a course project to perform location classification from an input image of the user's surroundings. The dataset represents real-world indoor and outdoor environments with varying lighting conditions, angles, and compositions.
Composition… See the full description on the dataset page: https://huggingface.co/datasets/madhavkarthi/project-1-location-classification-dataset.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains high-quality images of 5 different animal species: Cat, Cow, Lion, Deer, Dog — commonly used in beginner-level computer vision tasks such as image classification and model benchmarking.
Each image has been manually selected to ensure clarity and proper labeling.
The dataset is organized into three main folders:
animals_dataset/
├── train/
├── val/
└── test/
Each of these folders contains 5 subfolders (one for each class):
├── cat/
├── cow/
├── lion/
├── deer/
└── dog/
All images are in .jpg
format and have been resized to a consistent shape (e.g., 224x224
) for ease of use in deep learning models.
This dataset is ideal for:
All images in this dataset are sourced from publicly available internet sources for educational and non-commercial research purposes only. If you are the owner of any image and wish to request removal, please contact us.
This GitLab project contains the training data that was used for the metadata machine learning classification project.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
CS 4375 term project data compilation, labeled and converted to .csv
Xverum’s AI & ML Training Data provides one of the most extensive datasets available for AI and machine learning applications, featuring 800M B2B profiles with 100+ attributes. This dataset is designed to enable AI developers, data scientists, and businesses to train robust and accurate ML models. From natural language processing (NLP) to predictive analytics, our data empowers a wide range of industries and use cases with unparalleled scale, depth, and quality.
What Makes Our Data Unique?
Scale and Coverage: - A global dataset encompassing 800M B2B profiles from a wide array of industries and geographies. - Includes coverage across the Americas, Europe, Asia, and other key markets, ensuring worldwide representation.
Rich Attributes for Training Models: - Over 100 fields of detailed information, including company details, job roles, geographic data, industry categories, past experiences, and behavioral insights. - Tailored for training models in NLP, recommendation systems, and predictive algorithms.
Compliance and Quality: - Fully GDPR and CCPA compliant, providing secure and ethically sourced data. - Extensive data cleaning and validation processes ensure reliability and accuracy.
Annotation-Ready: - Pre-structured and formatted datasets that are easily ingestible into AI workflows. - Ideal for supervised learning with tagging options such as entities, sentiment, or categories.
How Is the Data Sourced? - Publicly available information gathered through advanced, GDPR-compliant web aggregation techniques. - Proprietary enrichment pipelines that validate, clean, and structure raw data into high-quality datasets. This approach ensures we deliver comprehensive, up-to-date, and actionable data for machine learning training.
Primary Use Cases and Verticals
Natural Language Processing (NLP): Train models for named entity recognition (NER), text classification, sentiment analysis, and conversational AI. Ideal for chatbots, language models, and content categorization.
Predictive Analytics and Recommendation Systems: Enable personalized marketing campaigns by predicting buyer behavior. Build smarter recommendation engines for ecommerce and content platforms.
B2B Lead Generation and Market Insights: Create models that identify high-value leads using enriched company and contact information. Develop AI systems that track trends and provide strategic insights for businesses.
HR and Talent Acquisition AI: Optimize talent-matching algorithms using structured job descriptions and candidate profiles. Build AI-powered platforms for recruitment analytics.
How This Product Fits Into Xverum’s Broader Data Offering Xverum is a leading provider of structured, high-quality web datasets. While we specialize in B2B profiles and company data, we also offer complementary datasets tailored for specific verticals, including ecommerce product data, job listings, and customer reviews. The AI Training Data is a natural extension of our core capabilities, bridging the gap between structured data and machine learning workflows. By providing annotation-ready datasets, real-time API access, and customization options, we ensure our clients can seamlessly integrate our data into their AI development processes.
Why Choose Xverum? - Experience and Expertise: A trusted name in structured web data with a proven track record. - Flexibility: Datasets can be tailored for any AI/ML application. - Scalability: With 800M profiles and more being added, you’ll always have access to fresh, up-to-date data. - Compliance: We prioritize data ethics and security, ensuring all data adheres to GDPR and other legal frameworks.
Ready to supercharge your AI and ML projects? Explore Xverum’s AI Training Data to unlock the potential of 800M global B2B profiles. Whether you’re building a chatbot, predictive algorithm, or next-gen AI application, our data is here to help.
Contact us for sample datasets or to discuss your specific needs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The classification of variable objects provides insight into a wide variety of astrophysics ranging from stellar interiors to galactic nuclei. The Zwicky Transient Facility (ZTF) provides time series observations that record the variability of more than a billion sources. The scale of these data necessitates automated approaches to make a thorough analysis. Building on previous work, this paper reports the results of the ZTF Source Classification Project (SCoPe), which trains neural network and XGBoost machine learning (ML) algorithms to perform dichotomous classification of variable ZTF sources using a manually constructed training set containing 170,632 light curves. We find that several classifiers achieve high precision and recall scores, suggesting the reliability of their predictions for 373,819,334 light curves across 210 ZTF fields. We also identify the most important features for XGB classification and compare the performance of the two ML algorithms, finding a pattern of higher precision among XGB classifiers. The resulting classification catalog is available to the public, and the software developed for SCoPe is open-source and adaptable to future time-domain surveys.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
IOPA Classification Project is a dataset for classification tasks - it contains Objects annotations for 925 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.
With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.
We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.
Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.
Usage
You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.
Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.
Data Extraction: In your terminal, you can call either
make
(recommended), or
julia --project="." --eval "using Pkg; Pkg.instantiate()"
julia --project="." extract-oq.jl
Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.
Further Reading
Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification
This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.
Key Features:
LLM-generated annotations: Both datasets feature website topic labels generated using LLMs, a novel approach to creating high-quality training data for website classification models.
Improved multi-label classification: Fine-tuning Homepage2Vec with these datasets has been shown to improve its macro F1 score from 38% to 43% evaluated on a human-labeled dataset, demonstrating their effectiveness in capturing a broader range of website topics.
Multilingual applicability: The datasets facilitate classification of websites in multiple languages, reflecting the inherent multilingual nature of Homepage2Vec.
Dataset Composition:
curlie-gpt3.5-10k: 10,000 websites labeled using GPT-3.5, context 2 and 1-shot
curlie-gpt4-10k: 10,000 websites labeled using GPT-4, context 2 and zero-shot
Intended Use:
Fine-tuning and advancing Homepage2Vec or similar website classification models
Research on LLM-generated datasets for text classification tasks
Exploration of multilingual website classification
Additional Information:
Project and report repository: https://github.com/CS-433/ml-project-2-mlp
Acknowledgments:
This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset comprises English sentences labeled with their corresponding tense categories. It is intended for use in natural language processing (NLP) and machine learning projects to classify the tense of English sentences. Each entry includes a sentence and a numerical label representing its tense.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card for Code Comment Classification
Dataset Summary
The dataset contains class comments extracted from various big and diverse open-source projects of three programming languages Java, Smalltalk, and Python.
Supported Tasks and Leaderboards
Single-label text classification and Multi-label text classification
Languages
Java, Python, Smalltalk
Dataset Structure
Data Instances
{ "class" : "Absy.java", "comment":"*… See the full description on the dataset page: https://huggingface.co/datasets/poojaruhal/Code-comment-classification.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Fruit Classification and Freshness Detection Dataset 🔍 Overview This dataset has been meticulously curated to facilitate research and development in the domain of fruit classification and freshness detection using advanced deep learning techniques. It is designed to support the creation of hybrid models that integrate YOLOv8 for real-time object detection with Convolutional Neural Networks (CNNs) for assessing fruit freshness. The dataset encompasses a diverse range of images captured under varying lighting conditions and angles, simulating real-world scenarios such as grocery stores, farms, and storage facilities.
The dataset comprises 8,099 high-resolution images of three commonly consumed fruits—apples, bananas, and oranges—each categorized into fresh and rotten conditions. Every image has been manually annotated in the YOLO format to aid object detection tasks and labeled for binary classification (Fresh/Rotten), enabling comprehensive model training.
📁 Dataset Structure Total Images: 8,099
Training Set: 6,508 images (80%)
Test Set: 1,591 images (20%)
Classes (6 total):
Fresh Apples
Rotten Apples
Fresh Bananas
Rotten Bananas
Fresh Oranges
Rotten Oranges
Annotations: Provided in YOLO format using LabelImg
Image Format: JPG, resized to 300x300 pixels
Captured With: Smartphone camera under varied lighting and angles
🧠 Applications This dataset is ideal for:
Object Detection using YOLOv8
Freshness classification using CNN
Hybrid models combining detection and classification
Computer vision projects in smart agriculture, food safety, and automated retail systems
📊 Sample Use Case A hybrid deep learning model utilizing this dataset achieved:
Object Detection (YOLOv8):
mAP@0.5: 98%
mAP@0.5:0.95: 87%
Freshness Classification (CNN):
Test Accuracy: 97.6%
These results underscore the dataset’s suitability for high-performance, real-time AI applications in agricultural automation and food quality assessment.
👨💻 Contributors Prof. Shubhashree Sahoo
Dr. Sitanath Biswas
Mr. Shubham Kumar Sah
Mr. Chirag Nahata
Special thanks to Dr. Soumobroto Saha and Prof. (Dr.) Partha Sarkar for their invaluable guidance and support throughout this research endeavor.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains machine learning and volunteer classifications from the Gravity Spy project. It includes glitches from observing runs O1, O2, O3a and O3b that received at least one classification from a registered volunteer in the project. It also indicates glitches that are nominally retired from the project using our default set of retirement parameters, which are described below. See more details in the Gravity Spy Methods paper.
When a particular subject in a citizen science project (in this case, glitches from the LIGO datastream) is deemed to be classified sufficiently it is "retired" from the project. For the Gravity Spy project, retirement depends on a combination of both volunteer and machine learning classifications, and a number of parameterizations affect how quickly glitches get retired. For this dataset, we use a default set of retirement parameters, the most important of which are:
The choice of these and other parameterization will affect the accuracy of the retired dataset as well as the number of glitches that are retired, and will be explored in detail in an upcoming publication (Zevin et al. in prep).
The dataset can be read in using e.g. Pandas:
```
import pandas as pd
dataset = pd.read_hdf('retired_fulldata_min2_max50_ret0p9.hdf5', key='image_db')
```
Each row in the dataframe contains information about a particular glitch in the Gravity Spy dataset.
Description of series in dataframe
```
For machine learning classifications on all glitches in O1, O2, O3a, and O3b, please see Gravity Spy Machine Learning Classifications on Zenodo
For the most recently uploaded training set used in Gravity Spy machine learning algorithms, please see Gravity Spy Training Set on Zenodo.
For detailed information on the training set used for the original Gravity Spy machine learning paper, please see Machine learning for Gravity Spy: Glitch classification and dataset on Zenodo.
AutoTrain Dataset for project: meme-classification
Dataset Description
This dataset has been automatically processed by AutoTrain for project meme-classification.
Languages
The BCP-47 code for the dataset's language is unk.
Dataset Structure
Data Instances
A sample from this dataset looks as follows: [ { "image": "<657x657 RGB PIL image>", "target": 1 }, { "image": "<1124x700 RGB PIL image>", "target": 0 }]… See the full description on the dataset page: https://huggingface.co/datasets/Hrishikesh332/autotrain-data-meme-classification.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Tree Classification My Project is a dataset for classification tasks - it contains Trees Class annotations for 554 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).