100+ datasets found

m
AI & ML Training Data | Artificial Intelligence (AI) | Machine Learning (ML)...
apiscrapy.mydatastorefront.com
Updated Nov 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
APISCRAPY (2024). AI & ML Training Data | Artificial Intelligence (AI) | Machine Learning (ML) Datasets | Deep Learning Datasets | Easy to Integrate | Free Sample [Dataset]. https://apiscrapy.mydatastorefront.com/products/ai-ml-training-data-ai-learning-dataset-ml-learning-dataset-apiscrapy
Explore at:
Dataset updated
Nov 19, 2024
Dataset authored and provided by
APISCRAPY
Area covered
France, Canada, Switzerland, United Kingdom, Åland Islands, Monaco, Romania, Slovakia, Belgium, Japan
Description
APISCRAPY's AI & ML training data is meticulously curated and labelled to ensure the best quality. Our training data comes from a variety of areas, including healthcare and banking, as well as e-commerce and natural language processing.
A
Artificial Intelligence Training Dataset Report
datainsightsmarket.com
doc, pdf, ppt
Updated May 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Artificial Intelligence Training Dataset Report [Dataset]. https://www.datainsightsmarket.com/reports/artificial-intelligence-training-dataset-1958994
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
May 3, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global Artificial Intelligence (AI) Training Dataset market is experiencing robust growth, driven by the increasing adoption of AI across diverse sectors. The market's expansion is fueled by the burgeoning need for high-quality data to train sophisticated AI algorithms capable of powering applications like smart campuses, autonomous vehicles, and personalized healthcare solutions. The demand for diverse dataset types, including image classification, voice recognition, natural language processing, and object detection datasets, is a key factor contributing to market growth. While the exact market size in 2025 is unavailable, considering a conservative estimate of a $10 billion market in 2025 based on the growth trend and reported market sizes of related industries, and a projected CAGR (Compound Annual Growth Rate) of 25%, the market is poised for significant expansion in the coming years. Key players in this space are leveraging technological advancements and strategic partnerships to enhance data quality and expand their service offerings. Furthermore, the increasing availability of cloud-based data annotation and processing tools is further streamlining operations and making AI training datasets more accessible to businesses of all sizes. Growth is expected to be particularly strong in regions with burgeoning technological advancements and substantial digital infrastructure, such as North America and Asia Pacific. However, challenges such as data privacy concerns, the high cost of data annotation, and the scarcity of skilled professionals capable of handling complex datasets remain obstacles to broader market penetration. The ongoing evolution of AI technologies and the expanding applications of AI across multiple sectors will continue to shape the demand for AI training datasets, pushing this market toward higher growth trajectories in the coming years. The diversity of applications—from smart homes and medical diagnoses to advanced robotics and autonomous driving—creates significant opportunities for companies specializing in this market. Maintaining data quality, security, and ethical considerations will be crucial for future market leadership.
d
Data from: Training dataset for NABat Machine Learning V1.0
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
Explore at:
Dataset updated
Nov 26, 2025
Dataset provided by
U.S. Geological Survey
Description
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
TREC 2022 Deep Learning test collection
catalog.data.gov
gimi9.com
+1more
Updated May 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). TREC 2022 Deep Learning test collection [Dataset]. https://catalog.data.gov/dataset/trec-2022-deep-learning-test-collection
Explore at:
Dataset updated
May 9, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.
O
BUTTER - Empirical Deep Learning Dataset
data.openei.org
datasets.ai
+2more
code, data, website
Updated May 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek; Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek (2022). BUTTER - Empirical Deep Learning Dataset [Dataset]. http://doi.org/10.25984/1872441
Explore at:
code, website, dataAvailable download formats
Unique identifier
https://doi.org/10.25984/1872441
Dataset updated
May 20, 2022
Dataset provided by
National Renewable Energy Laboratory
USDOE Office of Energy Efficiency and Renewable Energy (EERE), Multiple Programs (EE)
Open Energy Data Initiative (OEDI)
Authors
Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek; Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The BUTTER Empirical Deep Learning Dataset represents an empirical study of the deep learning phenomena on dense fully connected networks, scanning across thirteen datasets, eight network shapes, fourteen depths, twenty-three network sizes (number of trainable parameters), four learning rates, six minibatch sizes, four levels of label noise, and fourteen levels of L1 and L2 regularization each. Multiple repetitions (typically 30, sometimes 10) of each combination of hyperparameters were preformed, and statistics including training and test loss (using a 80% / 20% shuffled train-test split) are recorded at the end of each training epoch. In total, this dataset covers 178 thousand distinct hyperparameter settings ("experiments"), 3.55 million individual training runs (an average of 20 repetitions of each experiments), and a total of 13.3 billion training epochs (three thousand epochs were covered by most runs). Accumulating this dataset consumed 5,448.4 CPU core-years, 17.8 GPU-years, and 111.2 node-years.
Open Source Pre-trained Models Dataset
kaggle.com
zip
Updated Mar 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yasir Raza (2023). Open Source Pre-trained Models Dataset [Dataset]. https://www.kaggle.com/datasets/yasirabdaali/open-source-pre-trained-models-dataset
Explore at:
zip(10923 bytes)Available download formats
Dataset updated
Mar 1, 2023
Authors
Yasir Raza
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains information about various open-source pre-trained models that are available on Kaggle. These models can be used for various machine learning and deep learning tasks such as image classification, natural language processing, object detection, etc. The dataset has the following features:

url: The link to the model page on Kaggle

title: The name of the model

description: A brief overview of the model.

model_details: organization, number of variations of the model, etc. architecture, input/output format, parameters, etc.

votes: The number of upvotes that the model has received from the Kaggle community.

The dataset can be useful for anyone who wants to explore different pre-trained models and compare their performance and features. It can also help in finding suitable models for specific problems or domains.
Datasets
figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bastian Eichenberger; YinXiu Zhan (2023). Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.12958037.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12958037.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Bastian Eichenberger; YinXiu Zhan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The benchmarking datasets used for deepBlink. The npz files contain train/valid/test splits inside and can be used directly. The files belong to the following challenges / classes:- ISBI Particle tracking challenge: microtubule, vesicle, receptor- Custom synthetic (based on http://smal.ws): particle- Custom fixed cell: smfish- Custom live cell: suntagThe csv files are to determine which image in the test splits correspond to which original image, SNR, and density.
Machine Learning Dataset
brightdata.com
.json, .csv, .xlsx
Updated Dec 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Dec 23, 2024
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.
m
LOCBEEF: Beef Quality Image dataset for Deep Learning Models
data.mendeley.com
Updated Nov 30, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tri Mulya Dharma (2022). LOCBEEF: Beef Quality Image dataset for Deep Learning Models [Dataset]. http://doi.org/10.17632/nhs6mjg6yy.1
Explore at:
Unique identifier
https://doi.org/10.17632/nhs6mjg6yy.1
Dataset updated
Nov 30, 2022
Authors
Tri Mulya Dharma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The LOCBEEF dataset contains 3268 images of local Aceh beef collected from 07:00 a.m - 22:00 p.m, more information about the clock is shown in Fig. The dataset contains two categories of directories, namely train, and test. Furthermore, each subdirectory consists of fresh and rotten. An example of the image can be seen in Figs. 2 and 3. The directory structure for the data is shown in Fig. 1. The image directory for train contains 2228 images each subdirectory contains 1114 images, and the test directory contains 980 images for each subdirectory containing 490 images. For images have a resolution of 176 x 144 pixel, 320 x 240 pixel, 640 x 480 pixel, 720 x 480 pixel, 720 x 720 pixel, 1280 x 720 pixel, 1920 x 1080 pixel, 2560 x 1920 pixel, 3120 x 3120 pixel, 3264 x 2248 pixel, and 4160 x 3120 pixel.

The classification of LOCBEEF datasets has been carried out using the deep learning method of Convolutional Neural Networks with an image composition of 70% training data and 30% test data. Images with the mentioned dimensions are included in the LOCBEEF dataset to apply to the Resnet50.
CSIRO Sentinel-1 SAR image dataset of oil- and non-oil features for machine...
data.csiro.au
researchdata.edu.au
Updated Dec 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Blondeau-Patissier; Thomas Schroeder; Foivos Diakogiannis; Zhibin Li (2022). CSIRO Sentinel-1 SAR image dataset of oil- and non-oil features for machine learning ( Deep Learning ) [Dataset]. http://doi.org/10.25919/4v55-dn16
Explore at:
Unique identifier
https://doi.org/10.25919/4v55-dn16
Dataset updated
Dec 15, 2022
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
David Blondeau-Patissier; Thomas Schroeder; Foivos Diakogiannis; Zhibin Li
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Time period covered
May 1, 2015 - Aug 31, 2022
Area covered

Dataset funded by
CSIROhttp://www.csiro.au/
ESA
Description
What this collection is: A curated, binary-classified image dataset of grayscale (1 band) 400 x 400-pixel size, or image chips, in a JPEG format extracted from processed Sentinel-1 Synthetic Aperture Radar (SAR) satellite scenes acquired over various regions of the world, and featuring clear open ocean chips, look-alikes (wind or biogenic features) and oil slick chips.

This binary dataset contains chips labelled as: - "0" for chips not containing any oil features (look-alikes or clean seas)
- "1" for those containing oil features.

This binary dataset is imbalanced, and biased towards "0" labelled chips (i.e., no oil features), which correspond to 66% of the dataset. Chips containing oil features, labelled "1", correspond to 34% of the dataset.

Why: This dataset can be used for training, validation and/or testing of machine learning, including deep learning, algorithms for the detection of oil features in SAR imagery. Directly applicable for algorithm development for the European Space Agency Sentinel-1 SAR mission (https://sentinel.esa.int/web/sentinel/missions/sentinel-1 ), it may be suitable for the development of detection algorithms for other SAR satellite sensors.

Overview of this dataset: Total number of chips (both classes) is N=5,630 Class 0 1 Total 3,725 1,905

Further information and description is found in the ReadMe file provided (ReadMe_Sentinel1_SAR_OilNoOil_20221215.txt)
d
80K+ Construction Site Images | AI Training Data | Machine Learning (ML)...
datarade.ai
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Seeds, 80K+ Construction Site Images | AI Training Data | Machine Learning (ML) data | Object & Scene Detection | Global Coverage [Dataset]. https://datarade.ai/data-products/50k-construction-site-images-ai-training-data-machine-le-data-seeds
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset authored and provided by
Data Seeds
Area covered
Russian Federation, Senegal, Guatemala, United Arab Emirates, Tunisia, Swaziland, Grenada, Peru, Venezuela (Bolivarian Republic of), Kenya
Description
This dataset features over 80,000 high-quality images of construction sites sourced from photographers worldwide. Built to support AI and machine learning applications, it delivers richly annotated and visually diverse imagery capturing real-world construction environments, machinery, and processes.

Key Features: 1. Comprehensive Metadata: the dataset includes full EXIF data such as aperture, ISO, shutter speed, and focal length. Each image is annotated with construction phase, equipment types, safety indicators, and human activity context—making it ideal for object detection, site monitoring, and workflow analysis. Popularity metrics based on performance on our proprietary platform are also included.

Unique Sourcing Capabilities: images are collected through a proprietary gamified platform, with competitions focused on industrial, construction, and labor themes. Custom datasets can be generated within 72 hours to target specific scenarios, such as building types, stages (excavation, framing, finishing), regions, or safety compliance visuals.

Global Diversity: sourced from contributors in over 100 countries, the dataset reflects a wide range of construction practices, materials, climates, and regulatory environments. It includes residential, commercial, industrial, and infrastructure projects from both urban and rural areas.

High-Quality Imagery: includes a mix of wide-angle site overviews, close-ups of tools and equipment, drone shots, and candid human activity. Resolution varies from standard to ultra-high-definition, supporting both macro and contextual analysis.

Popularity Scores: each image is assigned a popularity score based on its performance in GuruShots competitions. These scores provide insight into visual clarity, engagement value, and human interest—useful for safety-focused or user-facing AI models.

AI-Ready Design: this dataset is structured for training models in real-time object detection (e.g., helmets, machinery), construction progress tracking, material identification, and safety compliance. It’s compatible with standard ML frameworks used in construction tech.

Licensing & Compliance: fully compliant with privacy, labor, and workplace imagery regulations. Licensing is transparent and ready for commercial or research deployment.

Use Cases: 1. Training AI for safety compliance monitoring and PPE detection. 2. Powering progress tracking and material usage analysis tools. 3. Supporting site mapping, autonomous machinery, and smart construction platforms. 4. Enhancing augmented reality overlays and digital twin models for construction planning.

This dataset provides a comprehensive, real-world foundation for AI innovation in construction technology, safety, and operational efficiency. Custom datasets are available on request. Contact us to learn more!
Colour-Greyscale Dataset
kaggle.com
zip
Updated Jul 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rohan Rao Eravelli (2024). Colour-Greyscale Dataset [Dataset]. https://www.kaggle.com/datasets/rohanraoeravelli/colour-greyscale-dataset
Explore at:
zip(194671456 bytes)Available download formats
Dataset updated
Jul 8, 2024
Authors
Rohan Rao Eravelli
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset provides a valuable resource for exploring the fundamentals of color grading models in deep learning. Here's an expanded description highlighting its key features and potential applications:

Structure:

Two main folders: Cars and Flowers Each folder contains 400 images (200 color + 200 grayscale) This balanced representation facilitates model training and evaluation on both color and grayscale data. Applications:

Learning Color Grading Concepts: The dataset's simplicity allows beginners to grasp the core principles of color grading models. By training models to transform grayscale images to their colored counterparts (for Cars and Flowers) and vice versa, users can understand how these models learn the relationships between color and grayscale representations. Experimentation with Model Architectures: The dataset's size is suitable for testing and comparing different deep learning architectures for color grading tasks. This exploration can help identify efficient models that achieve good results on a manageable dataset. Fine-tuning Pre-trained Models: This dataset can be used for fine-tuning pre-trained models like convolutional neural networks (CNNs) that have already learned general image processing features. Fine-tuning leverages these pre-trained weights and focuses on color-specific relationships within the Cars and Flowers domain. Benchmarking Performance: The dataset can serve as a benchmark for evaluating the performance of new color grading models. By comparing the accuracy of different models in converting grayscale images to their color counterparts, researchers can track progress in the field.
A
Artificial Intelligence Training Dataset Report
archivemarketresearch.com
doc, pdf, ppt
Updated Feb 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Artificial Intelligence Training Dataset Report [Dataset]. https://www.archivemarketresearch.com/reports/artificial-intelligence-training-dataset-38645
Explore at:
pdf, ppt, docAvailable download formats
Dataset updated
Feb 21, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global Artificial Intelligence (AI) Training Dataset market is projected to reach $1605.2 million by 2033, exhibiting a CAGR of 9.4% from 2025 to 2033. The surge in demand for AI training datasets is driven by the increasing adoption of AI and machine learning technologies in various industries such as healthcare, financial services, and manufacturing. Moreover, the growing need for reliable and high-quality data for training AI models is further fueling the market growth. Key market trends include the increasing adoption of cloud-based AI training datasets, the emergence of synthetic data generation, and the growing focus on data privacy and security. The market is segmented by type (image classification dataset, voice recognition dataset, natural language processing dataset, object detection dataset, and others) and application (smart campus, smart medical, autopilot, smart home, and others). North America is the largest regional market, followed by Europe and Asia Pacific. Key companies operating in the market include Appen, Speechocean, TELUS International, Summa Linguae Technologies, and Scale AI. Artificial Intelligence (AI) training datasets are critical for developing and deploying AI models. These datasets provide the data that AI models need to learn, and the quality of the data directly impacts the performance of the model. The AI training dataset market landscape is complex, with many different providers offering datasets for a variety of applications. The market is also rapidly evolving, as new technologies and techniques are developed for collecting, labeling, and managing AI training data.
a
MNIST
datasets.activeloop.ai
deeplake
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yann LeCun, MNIST [Dataset]. https://datasets.activeloop.ai/docs/ml/datasets/mnist/
Explore at:
deeplakeAvailable download formats
Authors
Yann LeCun
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Time period covered
Jan 1, 1998 - Dec 31, 2000
Area covered
Earth
Dataset funded by
AT&T Bell Labs
Description
The MNIST dataset is a dataset of handwritten digits. It is a popular dataset for machine learning and artificial intelligence research. The dataset consists of 60,000 training images and 10,000 test images. Each image is a 28x28 pixel grayscale image of a handwritten digit. The digits are labeled from 0 to 9.
m
Data from: SalmonScan: A Novel Image Dataset for Machine Learning and Deep...
data.mendeley.com
Updated Apr 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Shoaib Ahmed (2024). SalmonScan: A Novel Image Dataset for Machine Learning and Deep Learning Analysis in Fish Disease Detection in Aquaculture [Dataset]. http://doi.org/10.17632/x3fz2nfm4w.3
Explore at:
Unique identifier
https://doi.org/10.17632/x3fz2nfm4w.3
Dataset updated
Apr 2, 2024
Authors
Md Shoaib Ahmed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The SalmonScan dataset is a collection of images of salmon fish, including healthy fish and infected fish. The dataset consists of two classes of images:

Fresh salmon 🐟 Infected Salmon 🐠

This dataset is ideal for various computer vision tasks in machine learning and deep learning applications. Whether you are a researcher, developer, or student, the SalmonScan dataset offers a rich and diverse data source to support your projects and experiments.

So, dive in and explore the fascinating world of salmon health and disease!

The SalmonScan dataset (raw) consists of 24 fresh fish and 91 infected fish. [Due to server cleaning in the past, some raw datasets have been deleted]

The SalmonScan dataset (augmented) consists of approximately 1,208 images of salmon fish, classified into two classes:

Fresh salmon (healthy fish with no visible signs of disease), 456 images

Infected Salmon containing disease, 752 images

Each class contains a representative and diverse collection of images, capturing a range of different perspectives, scales, and lighting conditions. The images have been carefully curated to ensure that they are of high quality and suitable for use in a variety of computer vision tasks.

Data Preprocessing

The input images were preprocessed to enhance their quality and suitability for further analysis. The following steps were taken:

Resizing 📏: All the images were resized to a uniform size of 600 pixels in width and 250 pixels in height to ensure compatibility with the learning algorithm. Image Augmentation 📸: To overcome the small amount of images, various image augmentation techniques were applied to the input images. These included: Horizontal Flip ↩️: The images were horizontally flipped to create additional samples. Vertical Flip ⬆️: The images were vertically flipped to create additional samples. Rotation 🔄: The images were rotated to create additional samples. Cropping 🪓: A portion of the image was randomly cropped to create additional samples. Gaussian Noise 🌌: Gaussian noise was added to the images to create additional samples. Shearing 🌆: The images were sheared to create additional samples. Contrast Adjustment (Gamma) ⚖️: The gamma correction was applied to the images to adjust their contrast. Contrast Adjustment (Sigmoid) ⚖️: The sigmoid function was applied to the images to adjust their contrast.

Usage

To use the salmon scan dataset in your ML and DL projects, follow these steps:

Clone or download the salmon scan dataset repository from GitHub.

Use standard libraries such as numpy or pandas to convert the images into arrays, which can be input into a machine learning or deep learning model.

Split the dataset into training, validation, and test sets as per your requirement.

Preprocess the data as needed, such as resizing and normalizing the images.

Train your ML/DL model using the preprocessed training data.

Evaluate the model on the test set and make predictions on new, unseen data.
BUTTER-E - Energy Consumption Data for the BUTTER Empirical Deep Learning...
data.openei.org
datasets.ai
+2more
archive, code, data +1
Updated Dec 30, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charles Tripp; Jordan Perr-Sauer; Erik Bensen; Jamil Gafur; Ambarish Nag; Avi Purkayastha; Charles Tripp; Jordan Perr-Sauer; Erik Bensen; Jamil Gafur; Ambarish Nag; Avi Purkayastha (2022). BUTTER-E - Energy Consumption Data for the BUTTER Empirical Deep Learning Dataset [Dataset]. http://doi.org/10.25984/2329316
Explore at:
code, data, archive, websiteAvailable download formats
Unique identifier
https://doi.org/10.25984/2329316
Dataset updated
Dec 30, 2022
Dataset provided by
United States Department of Energyhttp://energy.gov/
National Renewable Energy Laboratory
Open Energy Data Initiative (OEDI)
Authors
Charles Tripp; Jordan Perr-Sauer; Erik Bensen; Jamil Gafur; Ambarish Nag; Avi Purkayastha; Charles Tripp; Jordan Perr-Sauer; Erik Bensen; Jamil Gafur; Ambarish Nag; Avi Purkayastha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The BUTTER-E - Energy Consumption Data for the BUTTER Empirical Deep Learning Dataset adds node-level energy consumption data from watt-meters to the primary sweep of the BUTTER - Empirical Deep Learning Dataset. This dataset contains energy consumption and performance data from 63,527 individual experimental runs spanning 30,582 distinct configurations: 13 datasets, 20 sizes (number of trainable parameters), 8 network "shapes", and 14 depths on both CPU and GPU hardware collected using node-level watt-meters. This dataset reveals the complex relationship between dataset size, network structure, and energy use, and highlights the impact of cache effects.

BUTTER-E is intended to be joined with the BUTTER dataset (see "BUTTER - Empirical Deep Learning Dataset on OEDI" resource below) which characterizes the performance of 483k distinct fully connected neural networks but does not include energy measurements.
e
power tower training dataset for deep learning model - Dataset -...
energydata.info
Updated Aug 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). power tower training dataset for deep learning model - Dataset - ENERGYDATA.INFO [Dataset]. https://energydata.info/dataset/power-tower-training-dataset-for-deep-learning-model
Explore at:
Dataset updated
Aug 30, 2024
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This is the training dataset for power tower deep learning model development. The dataset contains 50cm resolution Mapbox image tiles (Maxar imagery) as well as the power tower location presence in the imagery as geojson file. Both the geographic coordinates and the pixel coordinates of the power towers have been incorporated. The dataset covers pilot areas in west coast of Liberia, Yemen and India.
i
Train Image Classification Dataset
images.cv
zip
Updated Nov 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Train Image Classification Dataset [Dataset]. https://images.cv/dataset/train-image-classification-dataset
Explore at:
zipAvailable download formats
Dataset updated
Nov 27, 2025
License
https://images.cv/licensehttps://images.cv/license
Description
Labeled Train images suitable for training and evaluating computer vision and deep learning models.
m
Encrypted Traffic Feature Dataset for Machine Learning and Deep Learning...
data.mendeley.com
Updated Dec 6, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zihao Wang (2022). Encrypted Traffic Feature Dataset for Machine Learning and Deep Learning based Encrypted Traffic Analysis [Dataset]. http://doi.org/10.17632/xw7r4tt54g.1
Explore at:
Unique identifier
https://doi.org/10.17632/xw7r4tt54g.1
Dataset updated
Dec 6, 2022
Authors
Zihao Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This traffic dataset contains a balance size of encrypted malicious and legitimate traffic for encrypted malicious traffic detection and analysis. The dataset is a secondary csv feature data that is composed of six public traffic datasets.

Our dataset is curated based on two criteria: The first criterion is to combine widely considered public datasets which contain enough encrypted malicious or encrypted legitimate traffic in existing works, such as Malware Capture Facility Project datasets. The second criterion is to ensure the final dataset balance of encrypted malicious and legitimate network traffic.

Based on the criteria, 6 public datasets are selected. After data pre-processing, details of each selected public dataset and the size of different encrypted traffic are shown in the “Dataset Statistic Analysis Document”. The document summarized the malicious and legitimate traffic size we selected from each selected public dataset, the traffic size of each malicious traffic type, and the total traffic size of the composed dataset. From the table, we are able to observe that encrypted malicious and legitimate traffic equally contributes to approximately 50% of the final composed dataset.

The datasets now made available were prepared to aim at encrypted malicious traffic detection. Since the dataset is used for machine learning or deep learning model training, a sample of train and test sets are also provided. The train and test datasets are separated based on 1:4. Such datasets can be used for machine learning or deep learning model training and testing based on selected features or after processing further data pre-processing.
t
Training dataset for deep learning surrogate model - Dataset - LDM
service.tib.eu
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Training dataset for deep learning surrogate model - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/training-dataset-for-deep-learning-surrogate-model
Explore at:
Dataset updated
Dec 16, 2024
Description
A dataset of 500 groups of well control sequences for the producers and injectors, and the production data sequences solved with the numerical simulator.

Facebook

Twitter

Click to copy link

Link copied

Cite

APISCRAPY (2024). AI & ML Training Data | Artificial Intelligence (AI) | Machine Learning (ML) Datasets | Deep Learning Datasets | Easy to Integrate | Free Sample [Dataset]. https://apiscrapy.mydatastorefront.com/products/ai-ml-training-data-ai-learning-dataset-ml-learning-dataset-apiscrapy

AI & ML Training Data | Artificial Intelligence (AI) | Machine Learning (ML) Datasets | Deep Learning Datasets | Easy to Integrate | Free Sample

Explore at:

Dataset updated

Nov 19, 2024

Dataset authored and provided by

APISCRAPY

Area covered

France, Canada, Switzerland, United Kingdom, Åland Islands, Monaco, Romania, Slovakia, Belgium, Japan

Description

APISCRAPY's AI & ML training data is meticulously curated and labelled to ensure the best quality. Our training data comes from a variety of areas, including healthcare and banking, as well as e-commerce and natural language processing.

Clear search

Close search

Google apps

Main menu

AI & ML Training Data | Artificial Intelligence (AI) | Machine Learning (ML)...

Artificial Intelligence Training Dataset Report

Data from: Training dataset for NABat Machine Learning V1.0

TREC 2022 Deep Learning test collection

BUTTER - Empirical Deep Learning Dataset

Open Source Pre-trained Models Dataset

Datasets

Machine Learning Dataset

LOCBEEF: Beef Quality Image dataset for Deep Learning Models

CSIRO Sentinel-1 SAR image dataset of oil- and non-oil features for machine...

80K+ Construction Site Images | AI Training Data | Machine Learning (ML)...

Colour-Greyscale Dataset

Artificial Intelligence Training Dataset Report

MNIST

Data from: SalmonScan: A Novel Image Dataset for Machine Learning and Deep...

BUTTER-E - Energy Consumption Data for the BUTTER Empirical Deep Learning...

power tower training dataset for deep learning model - Dataset -...

Train Image Classification Dataset

Encrypted Traffic Feature Dataset for Machine Learning and Deep Learning...

Training dataset for deep learning surrogate model - Dataset - LDM

AI & ML Training Data | Artificial Intelligence (AI) | Machine Learning (ML) Datasets | Deep Learning Datasets | Easy to Integrate | Free Sample