100+ datasets found

ML Datasets
kaggle.com
Updated May 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bikram Saha (2023). ML Datasets [Dataset]. https://www.kaggle.com/datasets/imbikramsaha/ml-datasets/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 1, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Bikram Saha
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The dataset contains a diverse range of examples, including classification, regression, clustering, and dimensionality reduction problems, with varying levels of complexity and varying numbers of features. Each dataset comes with a detailed description of the problem and the corresponding features, making it easy to understand and work with. Additionally, the dataset provides an opportunity for machine learning enthusiasts to experiment with different SkLearn algorithms and evaluate their performance on different datasets. This dataset is perfect for both beginners and advanced practitioners looking to hone their skills in various machine learning techniques.
Machine Learning Dataset
brightdata.com
.json, .csv, .xlsx
Updated Dec 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Dec 23, 2024
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.
a
ai training dataset Report
datainsightsmarket.com
doc, pdf, ppt
Updated May 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). ai training dataset Report [Dataset]. https://www.datainsightsmarket.com/reports/ai-training-dataset-1502524
Explore at:
doc, pdf, pptAvailable download formats
Dataset updated
May 10, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
CA
Variables measured
Market Size
Description
The AI training dataset market is experiencing robust growth, driven by the increasing adoption of artificial intelligence across diverse sectors. The market's expansion is fueled by the need for high-quality, labeled data to train sophisticated AI models capable of handling complex tasks. Applications span various industries, including IT, automotive, healthcare, BFSI (Banking, Financial Services, and Insurance), and retail & e-commerce. The demand for diverse data types—text, image/video, and audio—further fuels market expansion. While precise market sizing is unavailable, considering the rapid growth of AI and the significant investment in data annotation services, a reasonable estimate places the 2025 market value at approximately $15 billion, with a compound annual growth rate (CAGR) of 25% projected through 2033. This growth reflects a rising awareness of the pivotal role high-quality datasets play in achieving accurate and reliable AI outcomes. Key restraining factors include the high cost of data acquisition and annotation, along with concerns around data privacy and security. However, these challenges are being addressed through advancements in automation and the emergence of innovative data synthesis techniques. The competitive landscape is characterized by a mix of established technology giants like Google, Amazon, and Microsoft, alongside specialized data annotation companies like Appen and Lionbridge. The market is expected to see continued consolidation as larger players acquire smaller firms to expand their data offerings and strengthen their market position. Regional variations exist, with North America and Europe currently dominating the market share, although regions like Asia-Pacific are projected to experience significant growth due to increasing AI adoption and investments.
i
Data set for various metal types
ieee-dataport.org
Updated Jun 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RADHAMADHAB DALAI (2020). Data set for various metal types [Dataset]. https://ieee-dataport.org/open-access/data-set-various-metal-types
Explore at:
Dataset updated
Jun 25, 2020
Authors
RADHAMADHAB DALAI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
scaled and modified to represent a number a training set dataset.It can be used to detect and identify object type based on material type in the image.In this process both training data set and test data set can be generated from these image files.
D
AI Training Dataset Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). AI Training Dataset Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-ai-training-dataset-market
Explore at:
csv, pptx, pdfAvailable download formats
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
AI Training Dataset Market Outlook

The global AI training dataset market size was valued at approximately USD 1.2 billion in 2023 and is projected to reach USD 6.5 billion by 2032, growing at a compound annual growth rate (CAGR) of 20.5% from 2024 to 2032. This substantial growth is driven by the increasing adoption of artificial intelligence across various industries, the necessity for large-scale and high-quality datasets to train AI models, and the ongoing advancements in AI and machine learning technologies.

One of the primary growth factors in the AI training dataset market is the exponential increase in data generation across multiple sectors. With the proliferation of internet usage, the expansion of IoT devices, and the digitalization of industries, there is an unprecedented volume of data being generated daily. This data is invaluable for training AI models, enabling them to learn and make more accurate predictions and decisions. Moreover, the need for diverse and comprehensive datasets to improve AI accuracy and reliability is further propelling market growth.

Another significant factor driving the market is the rising investment in AI and machine learning by both public and private sectors. Governments around the world are recognizing the potential of AI to transform economies and improve public services, leading to increased funding for AI research and development. Simultaneously, private enterprises are investing heavily in AI technologies to gain a competitive edge, enhance operational efficiency, and innovate new products and services. These investments necessitate high-quality training datasets, thereby boosting the market.

The proliferation of AI applications in various industries, such as healthcare, automotive, retail, and finance, is also a major contributor to the growth of the AI training dataset market. In healthcare, AI is being used for predictive analytics, personalized medicine, and diagnostic automation, all of which require extensive datasets for training. The automotive industry leverages AI for autonomous driving and vehicle safety systems, while the retail sector uses AI for personalized shopping experiences and inventory management. In finance, AI assists in fraud detection and risk management. The diverse applications across these sectors underline the critical need for robust AI training datasets.

As the demand for AI applications continues to grow, the role of Ai Data Resource Service becomes increasingly vital. These services provide the necessary infrastructure and tools to manage, curate, and distribute datasets efficiently. By leveraging Ai Data Resource Service, organizations can ensure that their AI models are trained on high-quality and relevant data, which is crucial for achieving accurate and reliable outcomes. The service acts as a bridge between raw data and AI applications, streamlining the process of data acquisition, annotation, and validation. This not only enhances the performance of AI systems but also accelerates the development cycle, enabling faster deployment of AI-driven solutions across various sectors.

Regionally, North America currently dominates the AI training dataset market due to the presence of major technology companies and extensive R&D activities in the region. However, Asia Pacific is expected to witness the highest growth rate during the forecast period, driven by rapid technological advancements, increasing investments in AI, and the growing adoption of AI technologies across various industries in countries like China, India, and Japan. Europe and Latin America are also anticipated to experience significant growth, supported by favorable government policies and the increasing use of AI in various sectors.

Data Type Analysis

The data type segment of the AI training dataset market encompasses text, image, audio, video, and others. Each data type plays a crucial role in training different types of AI models, and the demand for specific data types varies based on the application. Text data is extensively used in natural language processing (NLP) applications such as chatbots, sentiment analysis, and language translation. As the use of NLP is becoming more widespread, the demand for high-quality text datasets is continually rising. Companies are investing in curated text datasets that encompass diverse languages and dialects to improve the accuracy and efficiency of NLP models.

Image data is critical for computer vision application
f
Data_Sheet_1_Deep Learning in Alzheimer's Disease: Diagnostic Classification...
frontiersin.figshare.com
pdf
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taeho Jo; Kwangsik Nho; Andrew J. Saykin (2023). Data_Sheet_1_Deep Learning in Alzheimer's Disease: Diagnostic Classification and Prognostic Prediction Using Neuroimaging Data.pdf [Dataset]. http://doi.org/10.3389/fnagi.2019.00220.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fnagi.2019.00220.s001
Dataset updated
May 30, 2023
Dataset provided by
Frontiers
Authors
Taeho Jo; Kwangsik Nho; Andrew J. Saykin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Deep learning, a state-of-the-art machine learning approach, has shown outstanding performance over traditional machine learning in identifying intricate structures in complex high-dimensional data, especially in the domain of computer vision. The application of deep learning to early detection and automated classification of Alzheimer's disease (AD) has recently gained considerable attention, as rapid progress in neuroimaging techniques has generated large-scale multimodal neuroimaging data. A systematic review of publications using deep learning approaches and neuroimaging data for diagnostic classification of AD was performed. A PubMed and Google Scholar search was used to identify deep learning papers on AD published between January 2013 and July 2018. These papers were reviewed, evaluated, and classified by algorithm and neuroimaging type, and the findings were summarized. Of 16 studies meeting full inclusion criteria, 4 used a combination of deep learning and traditional machine learning approaches, and 12 used only deep learning approaches. The combination of traditional machine learning for classification and stacked auto-encoder (SAE) for feature selection produced accuracies of up to 98.8% for AD classification and 83.7% for prediction of conversion from mild cognitive impairment (MCI), a prodromal stage of AD, to AD. Deep learning approaches, such as convolutional neural network (CNN) or recurrent neural network (RNN), that use neuroimaging data without pre-processing for feature selection have yielded accuracies of up to 96.0% for AD classification and 84.2% for MCI conversion prediction. The best classification performance was obtained when multimodal neuroimaging and fluid biomarkers were combined. Deep learning approaches continue to improve in performance and appear to hold promise for diagnostic classification of AD using multimodal neuroimaging data. AD research that uses deep learning is still evolving, improving performance by incorporating additional hybrid data types, such as—omics data, increasing transparency with explainable approaches that add knowledge of specific disease-related features and mechanisms.
o
Sentinel-2 machine learning dataset for tree species classification in...
openagrar.de
Updated Mar 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maximilian Freudenberg; Sebastian Schnell; Paul Magdon (2024). Sentinel-2 machine learning dataset for tree species classification in Germany [Dataset]. http://doi.org/10.3220/DATA20240402122351-0
Explore at:
Unique identifier
https://doi.org/10.3220/DATA20240402122351-0
Dataset updated
Mar 5, 2024
Dataset provided by
Thünen Institute of Forest Ecosystems
University of Göttingen
University of Applied Sciences and Arts - HAWK, Göttingen
Authors
Maximilian Freudenberg; Sebastian Schnell; Paul Magdon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Germany
Description
The dataset contains time series of bottom of atmosphere (BOA) reflectance from the Sentinel-2 satellite mission for tree species classification in a machine learning context. BOA reflectance was computed with the FORCE processing engine (https://force-eo.readthedocs.io/en/latest/index.html) and the corresponding data cube is available at the CODE-DE (https://code-de.org/de/) or EO Lab (https://eo-lab.org/de/) platform. Alternatively, the BOA reflectance can be calculated using the provided FORCE parameter files (*.prm), guaranteeing that BOA values match the ones from the dataset. The time series were extracted from the FORCE data cube for individual tree positions as they are collected in the field by the German national forest inventory (NFI). A detailed description of NFI methodology is available here: https://bwi.info/Download/de/Methodik/. The timespan for the satellite observations is from July 2015 to October 2022 and BOA reflectance is labelled with tree species, diameter of the stem measured at a height of 1.3 m, height of the tree, area of the crown as projected to the ground, and additional variables. The dataset contains about 83 million data points from about 360.000 trees covering all environmental conditions in Germany. As reference for geolocation, the centre of the closest 1 km cell of the INSPIRE grid to the corresponding sampling unit of the NFI was used. The exact locations of the sampling units and individual tree positions are confidential. A short introduction on data access and analysis is provided in the Jupyter notebook (intro_to_dataset.ipynb) using Python. A description of the variables is provided below (Methodology) and in the database (table meta_col) along with a code table for the tree species (x_species). For a more detailed description of the dataset, the applied methodology and a discussion of error sources, please refer to the linked data publication paper. EPSG: 4326
Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...
zenodo.org
csv
Updated Sep 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous authors; Anonymous authors (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.6607065
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6607065
Dataset updated
Sep 15, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous authors; Anonymous authors
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.

Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.
Clothes Dataset
kaggle.com
Updated Dec 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RyanBadai (2024). Clothes Dataset [Dataset]. https://www.kaggle.com/datasets/ryanbadai/clothes-dataset/discussion?sort=undefined
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 20, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
RyanBadai
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains images of clothing items scraped from Carousell, an online marketplace, specifically curated for image classification tasks. It includes a diverse set of classes representing different types of clothing, making it an excellent resource for machine learning and computer vision projects. The dataset is organized into the following 15 classes: - Blazer - Celana_Panjang (Long Pants) - Celana_Pendek (Shorts) - Gaun (Dresses) - Hoodie - Jaket (Jacket) - Jaket_Denim (Denim Jacket) - Jaket_Olahraga (Sports Jacket) - Jeans - Kaos (T-shirt) - Kemeja (Shirt) - Mantel (Coat) - Polo - Rok (Skirt) - Sweter (Sweater)

The images in this dataset represent various styles, textures, and colors, offering a comprehensive resource for training models to recognize and classify clothing categories. It is ideal for tasks such as building fashion recommendation systems, creating virtual try-on applications, or studying visual trends in fashion e-commerce. Whether you are an enthusiast or a professional, this dataset can help explore and experiment with deep learning techniques in the realm of fashion.
US Deep Learning Market Analysis, Size, and Forecast 2025-2029
technavio.com
pdf
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2025). US Deep Learning Market Analysis, Size, and Forecast 2025-2029 [Dataset]. https://www.technavio.com/report/us-deep-learning-market-industry-analysis
Explore at:
pdfAvailable download formats
Dataset updated
Jul 8, 2025
Dataset provided by
TechNavio
Authors
Technavio
Time period covered
2025 - 2029
Description
Snapshot img

US Deep Learning Market Size 2025-2029

The deep learning market size in US is forecast to increase by USD 5.02 billion at a CAGR of 30.1% between 2024 and 2029.

The deep learning market is experiencing robust growth, driven by the increasing adoption of artificial intelligence (AI) in various industries for advanced solutioning. This trend is fueled by the availability of vast amounts of data, which is a key requirement for deep learning algorithms to function effectively. Industry-specific solutions are gaining traction, as businesses seek to leverage deep learning for specific use cases such as image and speech recognition, fraud detection, and predictive maintenance. Alongside, intuitive data visualization tools are simplifying complex neural network outputs, helping stakeholders understand and validate insights. However, challenges remain, including the need for powerful computing resources, data privacy concerns, and the high cost of implementing and maintaining deep learning systems. Despite these hurdles, the market's potential for innovation and disruption is immense, making it an exciting space for businesses to explore further. Semi-supervised learning, data labeling, and data cleaning facilitate efficient training of deep learning models. Cloud analytics is another significant trend, as companies seek to leverage cloud computing for cost savings and scalability.

What will be the Size of the market During the Forecast Period?

Request Free Sample

Deep learning, a subset of machine learning, continues to shape industries by enabling advanced applications such as image and speech recognition, text generation, and pattern recognition. Reinforcement learning, a type of deep learning, gains traction, with deep reinforcement learning leading the charge. Anomaly detection, a crucial application of unsupervised learning, safeguards systems against security vulnerabilities. Ethical implications and fairness considerations are increasingly important in deep learning, with emphasis on explainable AI and model interpretability. Graph neural networks and attention mechanisms enhance data preprocessing for sequential data modeling and object detection. Time series forecasting and dataset creation further expand deep learning's reach, while privacy preservation and bias mitigation ensure responsible use.

In summary, deep learning's market dynamics reflect a constant pursuit of innovation, efficiency, and ethical considerations. The Deep Learning Market in the US is flourishing as organizations embrace intelligent systems powered by supervised learning and emerging self-supervised learning techniques. These methods refine predictive capabilities and reduce reliance on labeled data, boosting scalability. BFSI firms utilize AI image recognition for various applications, including personalizing customer communication, maintaining a competitive edge, and automating repetitive tasks to boost productivity. Sophisticated feature extraction algorithms now enable models to isolate patterns with high precision, particularly in applications such as image classification for healthcare, security, and retail.

How is this market segmented and which is the largest segment?

The market research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

Application Image recognition Voice recognition Video surveillance and diagnostics Data mining Type Software Services Hardware End-user Security Automotive Healthcare Retail and commerce Others Geography North America US

By Application Insights

The Image recognition segment is estimated to witness significant growth during the forecast period. In the realm of artificial intelligence (AI) and machine learning, image recognition, a subset of computer vision, is gaining significant traction. This technology utilizes neural networks, deep learning models, and various machine learning algorithms to decipher visual data from images and videos. Image recognition is instrumental in numerous applications, including visual search, product recommendations, and inventory management. Consumers can take photographs of products to discover similar items, enhancing the online shopping experience. In the automotive sector, image recognition is indispensable for advanced driver assistance systems (ADAS) and autonomous vehicles, enabling the identification of pedestrians, other vehicles, road signs, and lane markings.

Furthermore, image recognition plays a pivotal role in augmented reality (AR) and virtual reality (VR) applications, where it tracks physical objects and overlays digital content onto real-world scenarios. The model training process involves the backpropagation algorithm, which calculates the loss fu
Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...
zenodo.org
data.europa.eu
zip
Updated Aug 24, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. http://doi.org/10.5281/zenodo.4571228
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4571228
Dataset updated
Aug 24, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is gathered on Sep. 17th 2020. It has more than 5.4K Python repositories that are hosted on GitHub. Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA.

The dataset is also de-duplicated using the CD4Py tool. The list of duplicate files is provided in duplicate_files.txt file.

All of its Python projects are processed in JSON-formatted files. They contain a seq2seq representation of each file, type-related hints, and information for machine learning models. The structure of JSON-formatted files is described in JSONOutput.md file.

The dataset is split into train, validation and test sets by source code files. The list of files and their corresponding set is provided in dataset_split.csv file.

Notable changes to each version of the dataset are documented in CHANGELOG.md.
Multi Cancer Dataset
kaggle.com
Updated Oct 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Obuli Sai Naren (2024). Multi Cancer Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/9537604
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/9537604
Dataset updated
Oct 3, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Obuli Sai Naren
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
🩺 Multi Cancer Dataset - 8 Types of Cancer Images

Overview

This dataset contains images of various cancer types, compiled for research and analysis purposes. It includes 8 main cancer classes and 26 subclasses, providing a rich resource for medical image classification and machine learning applications.

📝 Citation

If you use this dataset in your research or project, please make sure to cite it appropriately. Thanks! ❤️ You can check DOI Citation section at the bottom.

APA

Obuli Sai Naren. (2022). Multi Cancer Dataset [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/3415848

📊 Dataset Details

Cancer Classes Images
Acute Lymphoblastic Leukemia 4 20,000
Brain Cancer 3 15,000
Breast Cancer 2 10,000
Cervical Cancer 5 25,000
Kidney Cancer 2 10,000
Lung and Colon Cancer 5 25,000
Lymphoma 3 15,000
Oral Cancer 2 10,000

Total Images: 130,000
Format: JPEG
Dimensions: 512px × 512px

📂 Folder Structure & Class Names

Each subclass folder contains 5,000 images. The datasets referenced for each cancer type are linked below.

📄 Notes on Images

All subclass folders contain 5,000 images each.

Each image follows the naming format <subclass>_<serial_number>.jpg for easy reference.

For more detailed information on the dataset structure, preprocessing, and usage, please refer to the README.md file included in the dataset's main directory.

Feel free to download, analyze, and contribute! 📊💻
Network Traffic Dataset
kaggle.com
Updated Oct 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ravikumar Gattu (2023). Network Traffic Dataset [Dataset]. https://www.kaggle.com/datasets/ravikumargattu/network-traffic-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 31, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ravikumar Gattu
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The data presented here was obtained in a Kali Machine from University of Cincinnati,Cincinnati,OHIO by carrying out packet captures for 1 hour during the evening on Oct 9th,2023 using Wireshark.This dataset consists of 394137 instances were obtained and stored in a CSV (Comma Separated Values) file.This large dataset could be used utilised for different machine learning applications for instance classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.

The dataset can be used for a variety of machine learning tasks, such as network intrusion detection, traffic classification, and anomaly detection.

Content :

This network traffic dataset consists of 7 features.Each instance contains the information of source and destination IP addresses, The majority of the properties are numeric in nature, however there are also nominal and date kinds due to the Timestamp.

The network traffic flow statistics (No. Time Source Destination Protocol Length Info) were obtained using Wireshark (https://www.wireshark.org/).

Dataset Columns:

No : Number of Instance. Timestamp : Timestamp of instance of network traffic Source IP: IP address of Source Destination IP: IP address of Destination Portocol: Protocol used by the instance Length: Length of Instance Info: Information of Traffic Instance

Acknowledgements :

I would like thank University of Cincinnati for giving the infrastructure for generation of network traffic data set.

Ravikumar Gattu , Susmitha Choppadandi

Inspiration : This dataset goes beyond the majority of network traffic classification datasets, which only identify the type of application (WWW, DNS, ICMP,ARP,RARP) that an IP flow contains. Instead, it generates machine learning models that can identify specific applications (like Tiktok,Wikipedia,Instagram,Youtube,Websites,Blogs etc.) from IP flow statistics (there are currently 25 applications in total).

**Dataset License: ** CC0: Public Domain

Dataset Usages : This dataset can be used for different machine learning applications in the field of cybersecurity such as classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.

ML techniques benefits from this Dataset :

This dataset is highly useful because it consists of 394137 instances of network traffic data obtained by using the 25 applications on a public,private and Enterprise networks.Also,the dataset consists of very important features that can be used for most of the applications of Machine learning in cybersecurity.Here are few of the potential machine learning applications that could be benefited from this dataset are :

Network Performance Monitoring : This large network traffic data set can be utilised for analysing the network traffic to identifying the network patterns in the network .This help in designing the network security algorithms for minimise the network probelms.

Anamoly Detection : Large network traffic dataset can be utilised training the machine learning models for finding the irregularitues in the traffic which could help identify the cyber attacks.

3.Network Intrusion Detection : This large dataset could be utilised for machine algorithms training and designing the models for detection of the traffic issues,Malicious traffic network attacks and DOS attacks as well.
g
50 Types of Car Parts -Image Classification
gts.ai
kaggle.com
json
Updated Mar 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GTS (2024). 50 Types of Car Parts -Image Classification [Dataset]. https://gts.ai/dataset-download/50-types-of-car-parts-image-classification/
Explore at:
jsonAvailable download formats
Dataset updated
Mar 20, 2024
Dataset provided by
GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
Authors
GTS
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This is a dataset of images of 50 types of car parts. It includes a train set, a test set and a validation set. There are 50 classes of car parts...
d
Machine learning model that estimates public-supply deliveries for domestic...
catalog.data.gov
data.usgs.gov
+2more
Updated Aug 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Machine learning model that estimates public-supply deliveries for domestic and other use types [Dataset]. https://catalog.data.gov/dataset/machine-learning-model-that-estimates-public-supply-deliveries-for-domestic-and-other-use-
Explore at:
Dataset updated
Aug 29, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This child item describes a public-supply delivery machine learning model that was developed to estimate public-supply deliveries. Publicly supplied water may be delivered to domestic users or to commercial, industrial, institutional, and irrigation (CII) users. This model predicts total, domestic, and CII per capita rates for public-supply water service areas within the conterminous United States for 2009-2020. This child item contains model input datasets, code used to build the delivery machine learning model, and national predictions. This dataset is part of a larger data release using machine learning to predict public-supply water use for 12-digit hydrologic units from 2000-2020. This page includes the following file: delivery_water_use_model.zip - a zip file containing input datasets, scripts, and output datasets for the delivery water use machine learning model
Iris Species Dataset and Database
kaggle.com
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ghanshyam Saini (2025). Iris Species Dataset and Database [Dataset]. https://www.kaggle.com/datasets/ghnshymsaini/iris-species-dataset-and-database
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 15, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ghanshyam Saini
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Iris Flower Dataset

This is a classic and very widely used dataset in machine learning and statistics, often serving as a first dataset for classification problems. Introduced by the British statistician and biologist Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems," it is a foundational resource for learning classification algorithms.

Overview:

The dataset contains measurements for 150 samples of iris flowers. Each sample belongs to one of three species of iris:

Iris setosa

Iris versicolor

Iris virginica

For each flower, four features were measured:

Sepal length (in cm)

Sepal width (in cm)

Petal length (in cm)

Petal width (in cm)

The goal is typically to build a model that can classify iris flowers into their correct species based on these four features.

File Structure:

The dataset is usually provided as a single CSV (Comma Separated Values) file, often named iris.csv or similar. This file typically contains the following columns:

sepal_length (cm): Numerical. The length of the sepal of the iris flower.

sepal_width (cm): Numerical. The width of the sepal of the iris flower.

petal_length (cm): Numerical. The length of the petal of the iris flower.

petal_width (cm): Numerical. The width of the petal of the iris flower.

species: Categorical. The species of the iris flower (either 'setosa', 'versicolor', or 'virginica'). This is the target variable for classification.

Content of the Data:

The dataset contains an equal number of samples (50) for each of the three iris species. The measurements of the sepal and petal dimensions vary between the species, allowing for their differentiation using machine learning models.

How to Use This Dataset:

Download the iris.csv file.

Load the data using libraries like Pandas in Python.

Explore the data through visualization and statistical analysis to understand the relationships between the features and the different species.

Build classification models (e.g., Logistic Regression, Support Vector Machines, Decision Trees, K-Nearest Neighbors) using the sepal and petal measurements as features and the 'species' column as the target variable.

Evaluate the performance of your model using appropriate metrics (e.g., accuracy, precision, recall, F1-score).

The dataset is small and well-behaved, making it excellent for learning and experimenting with various classification techniques.

Citation:

When using the Iris dataset, it is common to cite Ronald Fisher's original work:

Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179-188.

Data Contribution:

Thank you for providing this classic and fundamental dataset to the Kaggle community. The Iris dataset remains an invaluable resource for both beginners learning the basics of classification and experienced practitioners testing new algorithms. Its simplicity and clear class separation make it an ideal starting point for many data science projects.

If you find this dataset description helpful and the dataset itself useful for your learning or projects, please consider giving it an upvote after downloading. Your appreciation is valuable!
Data from: MLOmics: Cancer Multi-Omics Database for Machine Learning
figshare.com
bin
Updated May 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rikuto Kotoge (2025). MLOmics: Cancer Multi-Omics Database for Machine Learning [Dataset]. http://doi.org/10.6084/m9.figshare.28729127.v2
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28729127.v2
Dataset updated
May 25, 2025
Dataset provided by
figshare
Authors
Rikuto Kotoge
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Framing the investigation of diverse cancers as a machine learning problem has recently shown significant potential in multi-omics analysis and cancer research. Empowering these successful machine learning models are the high-quality training datasets with sufficient data volume and adequate preprocessing. However, while there exist several public data portals including The Cancer Genome Atlas (TCGA) multi-omics initiative or open-bases such as the LinkedOmics, these databases are not off-the-shelf for existing machine learning models. we propose MLOmics, an open cancer multi-omics database aiming at serving better the development and evaluation of bioinformatics and machine learning models. MLOmics contains 8,314 patient samples covering all 32 cancer types with four omics types, stratified features, and extensive baselines. Complementary support for downstream analysis and bio-knowledge linking are also included to support interdisciplinary analysis.
m
Encrypted Traffic Feature Dataset for Machine Learning and Deep Learning...
data.mendeley.com
Updated Dec 6, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zihao Wang (2022). Encrypted Traffic Feature Dataset for Machine Learning and Deep Learning based Encrypted Traffic Analysis [Dataset]. http://doi.org/10.17632/xw7r4tt54g.1
Explore at:
Unique identifier
https://doi.org/10.17632/xw7r4tt54g.1
Dataset updated
Dec 6, 2022
Authors
Zihao Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This traffic dataset contains a balance size of encrypted malicious and legitimate traffic for encrypted malicious traffic detection and analysis. The dataset is a secondary csv feature data that is composed of six public traffic datasets.

Our dataset is curated based on two criteria: The first criterion is to combine widely considered public datasets which contain enough encrypted malicious or encrypted legitimate traffic in existing works, such as Malware Capture Facility Project datasets. The second criterion is to ensure the final dataset balance of encrypted malicious and legitimate network traffic.

Based on the criteria, 6 public datasets are selected. After data pre-processing, details of each selected public dataset and the size of different encrypted traffic are shown in the “Dataset Statistic Analysis Document”. The document summarized the malicious and legitimate traffic size we selected from each selected public dataset, the traffic size of each malicious traffic type, and the total traffic size of the composed dataset. From the table, we are able to observe that encrypted malicious and legitimate traffic equally contributes to approximately 50% of the final composed dataset.

The datasets now made available were prepared to aim at encrypted malicious traffic detection. Since the dataset is used for machine learning or deep learning model training, a sample of train and test sets are also provided. The train and test datasets are separated based on 1:4. Such datasets can be used for machine learning or deep learning model training and testing based on selected features or after processing further data pre-processing.
Intrusion_detection_dataset
kaggle.com
Updated Jun 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hamza Farooq (2023). Intrusion_detection_dataset [Dataset]. https://www.kaggle.com/datasets/ameerhamza123/intrusion-detection-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 23, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Hamza Farooq
Description
This dataset contains network traffic data collected from a computer network. The network consists of various devices, such as computers, servers, and routers, interconnected to facilitate communication and data exchange. The dataset captures different types of network activities, including normal network traffic as well as various network anomalies and attacks. It provides a comprehensive view of the network behavior and can be used for studying network security, intrusion detection, and anomaly detection algorithms. The dataset includes features such as source and destination IP addresses, port numbers, protocol types, packet sizes, and timestamps, enabling detailed analysis of network traffic patterns and characteristics and so on... The second file in this dataset contains synthetic data that has been generated using a Generative Adversarial Network (GAN). GANs are a type of deep learning model that can learn the underlying patterns and distributions of a given dataset and generate new synthetic samples that resemble the original data. In this case, the GAN has been trained on the network traffic data from the first file to learn the characteristics and structure of the network traffic. The generated synthetic data in the second file aims to mimic the patterns and behavior observed in real network traffic. This synthetic data can be used for various purposes, such as augmenting the original dataset, testing the robustness of machine learning models, or exploring different scenarios in network analysis.
f
Flagship Dataset of Type 2 Diabetes from the AI-READI Project
fairhub.io
Updated May 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI-READI Consortium (2024). Flagship Dataset of Type 2 Diabetes from the AI-READI Project [Dataset]. https://fairhub.io/datasets/1
Explore at:
Dataset updated
May 3, 2024
Dataset provided by
FAIRhub
Authors
AI-READI Consortium
Dataset funded by
National Institutes of Health
Description
This dataset contain data from 204 participants from the pilot period of the AI-READI project (July 19, 2023 to November 30, 2023). Data from multiple modalities are included. The data in this dataset contain no protected health information (PHI). Information related to the sex and race/ethnicity of the participants as well as medication used has also been removed. A detailed description of the dataset is available in the AI-READI documentation for v1.0.0 of the dataset at https://docs.aireadi.org

Cancer	Classes	Images
Acute Lymphoblastic Leukemia	4	20,000
Brain Cancer	3	15,000
Breast Cancer	2	10,000
Cervical Cancer	5	25,000
Kidney Cancer	2	10,000
Lung and Colon Cancer	5	25,000
Lymphoma	3	15,000
Oral Cancer	2	10,000

Facebook

Twitter

Click to copy link

Link copied

Cite

Bikram Saha (2023). ML Datasets [Dataset]. https://www.kaggle.com/datasets/imbikramsaha/ml-datasets/data

ML Datasets

This Kaggle dataset contains different kinds of Machine Learning Datasets.

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

May 1, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Bikram Saha

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

The dataset contains a diverse range of examples, including classification, regression, clustering, and dimensionality reduction problems, with varying levels of complexity and varying numbers of features. Each dataset comes with a detailed description of the problem and the corresponding features, making it easy to understand and work with. Additionally, the dataset provides an opportunity for machine learning enthusiasts to experiment with different SkLearn algorithms and evaluate their performance on different datasets. This dataset is perfect for both beginners and advanced practitioners looking to hone their skills in various machine learning techniques.

Clear search

Close search

Google apps

Main menu

ML Datasets

Machine Learning Dataset

ai training dataset Report

Data set for various metal types

AI Training Dataset Market Report | Global Forecast From 2025 To 2033

AI Training Dataset Market Outlook

Data Type Analysis

Data_Sheet_1_Deep Learning in Alzheimer's Disease: Diagnostic Classification...

Sentinel-2 machine learning dataset for tree species classification in...

Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

Clothes Dataset

US Deep Learning Market Analysis, Size, and Forecast 2025-2029

Snapshot img

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

Multi Cancer Dataset

🩺 Multi Cancer Dataset - 8 Types of Cancer Images

Overview

📝 Citation

APA

📊 Dataset Details

📂 Folder Structure & Class Names

📄 Notes on Images

Network Traffic Dataset

50 Types of Car Parts -Image Classification

Machine learning model that estimates public-supply deliveries for domestic...

Iris Species Dataset and Database

Iris Flower Dataset

Data from: MLOmics: Cancer Multi-Omics Database for Machine Learning

Encrypted Traffic Feature Dataset for Machine Learning and Deep Learning...

Intrusion_detection_dataset

Flagship Dataset of Type 2 Diabetes from the AI-READI Project

ML Datasets

This Kaggle dataset contains different kinds of Machine Learning Datasets.