100+ datasets found

Machine Learning Dataset
brightdata.com
.json, .csv, .xlsx
Updated Dec 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Dec 23, 2024
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.
d
A Dataset for Machine Learning Algorithm Development
catalog.data.gov
fisheries.noaa.gov
Updated May 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(Point of Contact, Custodian) (2024). A Dataset for Machine Learning Algorithm Development [Dataset]. https://catalog.data.gov/dataset/a-dataset-for-machine-learning-algorithm-development2
Explore at:
Dataset updated
May 1, 2024
Dataset provided by
(Point of Contact, Custodian)
Description
This dataset consists of imagery, imagery footprints, associated ice seal detections and homography files associated with the KAMERA Test Flights conducted in 2019. This dataset was subset to include relevant data for detection algorithm development. This dataset is limited to data collected during flights 4, 5, 6 and 7 from our 2019 surveys.
D
SYNERGY - Open machine learning dataset on study selection in systematic...
dataverse.nl
csv, json, txt, zip
Updated Apr 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan De Bruin; Jonathan De Bruin; Yongchao Ma; Yongchao Ma; Gerbrich Ferdinands; Gerbrich Ferdinands; Jelle Teijema; Jelle Teijema; Rens Van de Schoot; Rens Van de Schoot (2023). SYNERGY - Open machine learning dataset on study selection in systematic reviews [Dataset]. http://doi.org/10.34894/HE6NAQ
Explore at:
txt(212), json(702), zip(16028323), json(19426), txt(263), zip(3560967), txt(305), json(470), txt(279), zip(2355371), json(23201), csv(460956), txt(200), json(685), json(546), csv(63996), zip(2989015), zip(5749455), txt(331), txt(315), json(691), json(23775), csv(672721), json(468), txt(415), json(22778), csv(31919), csv(746832), json(18392), zip(62992826), csv(234822), txt(283), zip(34788857), json(475), txt(242), json(533), csv(42227), json(24548), zip(738232), json(22477), json(25491), zip(11463283), json(17741), csv(490660), json(19662), json(578), csv(19786), zip(14708207), zip(24619707), zip(2404439), json(713), json(27224), json(679), json(26426), txt(185), json(906), zip(18534723), json(23550), txt(266), txt(317), zip(6019723), json(33943), txt(436), csv(388378), json(469), zip(2106498), txt(320), csv(451336), txt(338), zip(19428163), json(14326), json(31652), txt(299), csv(96153), txt(220), csv(114789), json(15452), csv(5372708), json(908), csv(317928), csv(150923), json(465), csv(535584), json(26090), zip(8164831), json(19633), txt(316), json(23494), csv(133950), json(18638), csv(3944082), json(15345), json(473), zip(4411063), zip(10396095), zip(835096), txt(255), json(699), csv(654705), txt(294), csv(989865), zip(1028035), txt(322), zip(15085090), txt(237), txt(310), json(756), json(30628), json(19490), json(25908), txt(401), json(701), zip(5543909), json(29397), zip(14007470), json(30058), zip(58869042), csv(852937), json(35711), csv(298011), csv(187163), txt(258), zip(3526740), json(568), json(21552), zip(66466788), csv(215250), json(577), csv(103010), txt(306), zip(11840006)Available download formats
Unique identifier
https://doi.org/10.34894/HE6NAQ
Dataset updated
Apr 24, 2023
Dataset provided by
DataverseNL
Authors
Jonathan De Bruin; Jonathan De Bruin; Yongchao Ma; Yongchao Ma; Gerbrich Ferdinands; Gerbrich Ferdinands; Jelle Teijema; Jelle Teijema; Rens Van de Schoot; Rens Van de Schoot
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
SYNERGY is a free and open dataset on study selection in systematic reviews, comprising 169,288 academic works from 26 systematic reviews. Only 2,834 (1.67%) of the academic works in the binary classified dataset are included in the systematic reviews. This makes the SYNERGY dataset a unique dataset for the development of information retrieval algorithms, especially for sparse labels. Due to the many available variables available per record (i.e. titles, abstracts, authors, references, topics), this dataset is useful for researchers in NLP, machine learning, network analysis, and more. In total, the dataset contains 82,668,134 trainable data points. The easiest way to get the SYNERGY dataset is via the synergy-dataset Python package. See https://github.com/asreview/synergy-dataset for all information.
A
Artificial Intelligence Training Dataset Report
datainsightsmarket.com
doc, pdf, ppt
Updated May 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Artificial Intelligence Training Dataset Report [Dataset]. https://www.datainsightsmarket.com/reports/artificial-intelligence-training-dataset-1958994
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
May 3, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global Artificial Intelligence (AI) Training Dataset market is experiencing robust growth, driven by the increasing adoption of AI across diverse sectors. The market's expansion is fueled by the burgeoning need for high-quality data to train sophisticated AI algorithms capable of powering applications like smart campuses, autonomous vehicles, and personalized healthcare solutions. The demand for diverse dataset types, including image classification, voice recognition, natural language processing, and object detection datasets, is a key factor contributing to market growth. While the exact market size in 2025 is unavailable, considering a conservative estimate of a $10 billion market in 2025 based on the growth trend and reported market sizes of related industries, and a projected CAGR (Compound Annual Growth Rate) of 25%, the market is poised for significant expansion in the coming years. Key players in this space are leveraging technological advancements and strategic partnerships to enhance data quality and expand their service offerings. Furthermore, the increasing availability of cloud-based data annotation and processing tools is further streamlining operations and making AI training datasets more accessible to businesses of all sizes. Growth is expected to be particularly strong in regions with burgeoning technological advancements and substantial digital infrastructure, such as North America and Asia Pacific. However, challenges such as data privacy concerns, the high cost of data annotation, and the scarcity of skilled professionals capable of handling complex datasets remain obstacles to broader market penetration. The ongoing evolution of AI technologies and the expanding applications of AI across multiple sectors will continue to shape the demand for AI training datasets, pushing this market toward higher growth trajectories in the coming years. The diversity of applications—from smart homes and medical diagnoses to advanced robotics and autonomous driving—creates significant opportunities for companies specializing in this market. Maintaining data quality, security, and ethical considerations will be crucial for future market leadership.
AI & ML Popularity Index
kaggle.com
zip
Updated Jun 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Roshan Riaz (2024). AI & ML Popularity Index [Dataset]. https://www.kaggle.com/datasets/muhammadroshaanriaz/ai-and-ml-popularity-indexanalyzing-global-trends
Explore at:
zip(4198 bytes)Available download formats
Dataset updated
Jun 5, 2024
Authors
Muhammad Roshan Riaz
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Overview This dataset provides comprehensive insights into the global popularity trends of Artificial Intelligence (AI) and Machine Learning (ML). The data has been meticulously gathered and curated to reflect the growing interest and adoption of these technologies across various regions and sectors.

Data Sources The dataset aggregates information from multiple sources, including:

Search engine query data Social media mentions and hashtags Research publication counts Online course enrolments Job postings
Datasets
figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bastian Eichenberger; YinXiu Zhan (2023). Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.12958037.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12958037.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Bastian Eichenberger; YinXiu Zhan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The benchmarking datasets used for deepBlink. The npz files contain train/valid/test splits inside and can be used directly. The files belong to the following challenges / classes:- ISBI Particle tracking challenge: microtubule, vesicle, receptor- Custom synthetic (based on http://smal.ws): particle- Custom fixed cell: smfish- Custom live cell: suntagThe csv files are to determine which image in the test splits correspond to which original image, SNR, and density.
m
A dataset for machine learning research in the field of stress analyses of...
data.mendeley.com
narcis.nl
Updated Jul 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jaroslav Matej (2020). A dataset for machine learning research in the field of stress analyses of mechanical structures [Dataset]. http://doi.org/10.17632/wzbzznk8z3.2
Explore at:
Unique identifier
https://doi.org/10.17632/wzbzznk8z3.2
Dataset updated
Jul 25, 2020
Authors
Jaroslav Matej
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is prepared and intended as a data source for development of a stress analysis method based on machine learning. It consists of finite element stress analyses of randomly generated mechanical structures. The dataset contains more than 270,794 pairs of stress analyses images (von Mises stress) of randomly generated 2D structures with predefined thickness and material properties. All the structures are fixed at their bottom edges and loaded with gravity force only. See PREVIEW directory with some examples. The zip file contains all the files in the dataset.
Airoboros LLMs Math Dataset
kaggle.com
zip
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Airoboros LLMs Math Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/airoboros-llms-math-dataset
Explore at:
zip(36964941 bytes)Available download formats
Dataset updated
Nov 24, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Airoboros LLMs Math Dataset

Mastering Complex Mathematical Operations in Machine Learning

By Huggingface Hub [source]

About this dataset

The Airoboros-3.1 dataset is the perfect tool to help machine learning models excel in the difficult realm of complicated mathematical operations. This data collection features thousands of conversations between machines and humans, formatted in ShareGPT to maximize optimization in an OS ecosystem. The dataset’s focus on advanced subjects like factorials, trigonometry, and larger numerical values will help drive machine learning models to the next level - facilitating critical acquisition of sophisticated mathematical skills that are essential for ML success. As AI technology advances at such a rapid pace, training neural networks to correspondingly move forward can be a daunting and complicated challenge - but with Airoboros-3.1’s powerful datasets designed around difficult mathematical operations it just became one step closer to achievable!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

To get started, download the dataset from Kaggle and use the train.csv file. This file contains over two thousand examples of conversations between ML models and humans which have been formatted using ShareGPT - fast and efficient OS ecosystem fine-tuning tools designed to help with understanding mathematical operations more easily. The file includes two columns: category and conversations, both of which are marked as strings in the data itself.

Once you have downloaded the train file you can begin setting up your own ML training environment by using any of your preferred frameworks or methods. Your model should focus on predicting what kind of mathematical operations will likely be involved in future conversations by referring back to previous dialogues within this dataset for reference (category column). You can also create your own test sets from this data, adding new conversation topics either by modifying existing rows or creating new ones entirely with conversation topics related to mathematics. Finally, compare your model’s results against other established models or algorithms that are already published online!

Happy training!

Research Ideas

It can be used to build custom neural networks or machine learning algorithms that are specifically designed for complex mathematical operations.

This data set can be used to teach and debug more general-purpose machine learning models to recognize large numbers, and intricate calculations within natural language processing (NLP).

The Airoboros-3.1 dataset can also be utilized as a supervised learning task: models could learn from the conversations provided in the dataset how to respond correctly when presented with complex mathematical operations

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:------------------|:-----------------------------------------------------------------------------| | category | The type of mathematical operation being discussed. (String) | | conversations | The conversations between the machine learning model and the human. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21967265.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

GitHub page: https://github.com/soarsmu/NICHE
Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...
zenodo.org
csv
Updated Sep 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous authors; Anonymous authors (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.6607065
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6607065
Dataset updated
Sep 15, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous authors; Anonymous authors
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.

Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.
a
MNIST
datasets.activeloop.ai
deeplake
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yann LeCun, MNIST [Dataset]. https://datasets.activeloop.ai/docs/ml/datasets/mnist/
Explore at:
deeplakeAvailable download formats
Authors
Yann LeCun
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Time period covered
Jan 1, 1998 - Dec 31, 2000
Area covered
Earth
Dataset funded by
AT&T Bell Labs
Description
The MNIST dataset is a dataset of handwritten digits. It is a popular dataset for machine learning and artificial intelligence research. The dataset consists of 60,000 training images and 10,000 test images. Each image is a 28x28 pixel grayscale image of a handwritten digit. The digits are labeled from 0 to 9.
Android malware dataset for machine learning 2
figshare.com
txt
Updated Nov 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suleiman Yerima (2025). Android malware dataset for machine learning 2 [Dataset]. http://doi.org/10.6084/m9.figshare.5854653.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5854653.v1
Dataset updated
Nov 26, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Suleiman Yerima
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset consisting of feature vectors of 215 attributes extracted from 15,036 applications (5,560 malware apps from Drebin project and 9,476 benign apps). The dataset has been used to develop and evaluate multilevel classifier fusion approach for Android malware detection, published in the IEEE Transactions on Cybernetics paper 'DroidFusion: A Novel Multilevel Classifier Fusion Approach for Android Malware Detection'. The supporting file contains further description of the feature vectors/attributes obtained via static code analysis of the Android apps.
m
Encrypted Traffic Feature Dataset for Machine Learning and Deep Learning...
data.mendeley.com
Updated Dec 6, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zihao Wang (2022). Encrypted Traffic Feature Dataset for Machine Learning and Deep Learning based Encrypted Traffic Analysis [Dataset]. http://doi.org/10.17632/xw7r4tt54g.1
Explore at:
Unique identifier
https://doi.org/10.17632/xw7r4tt54g.1
Dataset updated
Dec 6, 2022
Authors
Zihao Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This traffic dataset contains a balance size of encrypted malicious and legitimate traffic for encrypted malicious traffic detection and analysis. The dataset is a secondary csv feature data that is composed of six public traffic datasets.

Our dataset is curated based on two criteria: The first criterion is to combine widely considered public datasets which contain enough encrypted malicious or encrypted legitimate traffic in existing works, such as Malware Capture Facility Project datasets. The second criterion is to ensure the final dataset balance of encrypted malicious and legitimate network traffic.

Based on the criteria, 6 public datasets are selected. After data pre-processing, details of each selected public dataset and the size of different encrypted traffic are shown in the “Dataset Statistic Analysis Document”. The document summarized the malicious and legitimate traffic size we selected from each selected public dataset, the traffic size of each malicious traffic type, and the total traffic size of the composed dataset. From the table, we are able to observe that encrypted malicious and legitimate traffic equally contributes to approximately 50% of the final composed dataset.

The datasets now made available were prepared to aim at encrypted malicious traffic detection. Since the dataset is used for machine learning or deep learning model training, a sample of train and test sets are also provided. The train and test datasets are separated based on 1:4. Such datasets can be used for machine learning or deep learning model training and testing based on selected features or after processing further data pre-processing.
Titanic Dataset - Machine Learning from Disaster
kaggle.com
zip
Updated Sep 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Chauhan (2022). Titanic Dataset - Machine Learning from Disaster [Dataset]. https://www.kaggle.com/datasets/whenamancodes/titanic-dataset-machine-learning-from-disaster
Explore at:
zip(34877 bytes)Available download formats
Dataset updated
Sep 20, 2022
Authors
Aman Chauhan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Overview

The data has been split into two groups:

training set (train.csv)

test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

Data Dictionary:

| Variable | Definition | Key | | --- | --- | | survival | Survival | 0 = No, 1 = Yes | | pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | | sex | Sex | | | Age | Age in years | | | sibsp | # of siblings / spouses aboard the Titanic | | | parch | # of parents / children aboard the Titanic | | | ticket | Ticket number | | | fare | Passenger fare | | | cabin | Cabin number | | | embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |

Variable Notes

pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

More - Find More Exciting🙀 Datasets Here - An Upvote👍 A Dayᕙ(`▿´)ᕗ , Keeps Aman Hurray Hurray..... ٩(˘◡˘)۶Hehe
Insurance Data for Machine Learning
kaggle.com
zip
Updated Apr 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SRIDHAR STREAKS (2023). Insurance Data for Machine Learning [Dataset]. https://www.kaggle.com/datasets/sridharstreaks/insurance-data-for-machine-learning
Explore at:
zip(22284586 bytes)Available download formats
Dataset updated
Apr 8, 2023
Authors
SRIDHAR STREAKS
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Insurance Dataset for Predicting Health Insurance Premiums in the US" is a collection of data on various factors that can influence medical costs and premiums for health insurance in the United States. The dataset includes information on 10 variables, including age, gender, body mass index (BMI), number of children, smoking status, region, income, education, occupation, and type of insurance plan. The dataset was created using a script that generated a million records of randomly sampled data points, ensuring that the data represented the population of insured individuals in the US. The dataset can be used to build and test machine learning models for predicting insurance premiums and exploring the relationship between different factors and medical costs.
Z
OpenABC-D: A Large-Scale Dataset For Machine Learning Guided Integrated...
data.niaid.nih.gov
Updated May 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Animesh Basak Chowdhury; Benjamin Tan; Ramesh Karri; Siddharth Garg (2022). OpenABC-D: A Large-Scale Dataset For Machine Learning Guided Integrated Circuit Synthesis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6399454
Explore at:
Dataset updated
May 13, 2022
Dataset provided by
New York University
University of Calgary
Authors
Animesh Basak Chowdhury; Benjamin Tan; Ramesh Karri; Siddharth Garg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Logic synthesis is a challenging and widely-researched combinatorial optimization problem during integrated circuit (IC) design. It transforms a high-level description of hardware in a programming language like Verilog into an optimized digital circuit netlist, a network of interconnected Boolean logic gates, that implements the function. Spurred by the success of ML in solving combinatorial and graph problems in other domains, there is growing interest in the design of ML-guided logic synthesis tools. Yet, there are no standard datasets or prototypical learning tasks defined for this problem domain. Here, we describe OpenABC-D,a large-scale, labeled dataset produced by synthesizing open source designs with a leading open-source logic synthesis tool and illustrate its use in developing, evaluating and benchmarking ML-guided logic synthesis. OpenABC-D has intermediate and final outputs in the form of 870,000 And-Inverter-Graphs (AIGs) produced from 1500 synthesis runs plus labels such as the optimized node counts, and de-lay. We define a generic learning problem on this dataset and benchmark existing solutions for it. The codes related to dataset creation and benchmark models are available athttps://github.com/NYU-MLDA/OpenABC.git.
n
Data from: Assessing predictive performance of supervised machine learning...
data.niaid.nih.gov
datasetcatalog.nlm.nih.gov
+1more
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evans Omondi (2023). Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model [Dataset]. http://doi.org/10.5061/dryad.wh70rxwrh
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.wh70rxwrh
Dataset updated
May 23, 2023
Dataset provided by
Strathmore University
Authors
Evans Omondi
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The diamond is 58 times harder than any other mineral in the world, and its elegance as a jewel has long been appreciated. Forecasting diamond prices is challenging due to nonlinearity in important features such as carat, cut, clarity, table, and depth. Against this backdrop, the study conducted a comparative analysis of the performance of multiple supervised machine learning models (regressors and classifiers) in predicting diamond prices. Eight supervised machine learning algorithms were evaluated in this work including Multiple Linear Regression, Linear Discriminant Analysis, eXtreme Gradient Boosting, Random Forest, k-Nearest Neighbors, Support Vector Machines, Boosted Regression and Classification Trees, and Multi-Layer Perceptron. The analysis is based on data preprocessing, exploratory data analysis (EDA), training the aforementioned models, assessing their accuracy, and interpreting their results. Based on the performance metrics values and analysis, it was discovered that eXtreme Gradient Boosting was the most optimal algorithm in both classification and regression, with a R2 score of 97.45% and an Accuracy value of 74.28%. As a result, eXtreme Gradient Boosting was recommended as the optimal regressor and classifier for forecasting the price of a diamond specimen. Methods Kaggle, a data repository with thousands of datasets, was used in the investigation. It is an online community for machine learning practitioners and data scientists, as well as a robust, well-researched, and sufficient resource for analyzing various data sources. On Kaggle, users can search for and publish various datasets. In a web-based data-science environment, they can study datasets and construct models.
d
Data from: Training dataset for NABat Machine Learning V1.0
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
Explore at:
Dataset updated
Nov 26, 2025
Dataset provided by
U.S. Geological Survey
Description
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
m
Data from: SalmonScan: A Novel Image Dataset for Machine Learning and Deep...
data.mendeley.com
Updated Apr 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Shoaib Ahmed (2024). SalmonScan: A Novel Image Dataset for Machine Learning and Deep Learning Analysis in Fish Disease Detection in Aquaculture [Dataset]. http://doi.org/10.17632/x3fz2nfm4w.3
Explore at:
Unique identifier
https://doi.org/10.17632/x3fz2nfm4w.3
Dataset updated
Apr 2, 2024
Authors
Md Shoaib Ahmed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The SalmonScan dataset is a collection of images of salmon fish, including healthy fish and infected fish. The dataset consists of two classes of images:

Fresh salmon 🐟 Infected Salmon 🐠

This dataset is ideal for various computer vision tasks in machine learning and deep learning applications. Whether you are a researcher, developer, or student, the SalmonScan dataset offers a rich and diverse data source to support your projects and experiments.

So, dive in and explore the fascinating world of salmon health and disease!

The SalmonScan dataset (raw) consists of 24 fresh fish and 91 infected fish. [Due to server cleaning in the past, some raw datasets have been deleted]

The SalmonScan dataset (augmented) consists of approximately 1,208 images of salmon fish, classified into two classes:

Fresh salmon (healthy fish with no visible signs of disease), 456 images

Infected Salmon containing disease, 752 images

Each class contains a representative and diverse collection of images, capturing a range of different perspectives, scales, and lighting conditions. The images have been carefully curated to ensure that they are of high quality and suitable for use in a variety of computer vision tasks.

Data Preprocessing

The input images were preprocessed to enhance their quality and suitability for further analysis. The following steps were taken:

Resizing 📏: All the images were resized to a uniform size of 600 pixels in width and 250 pixels in height to ensure compatibility with the learning algorithm. Image Augmentation 📸: To overcome the small amount of images, various image augmentation techniques were applied to the input images. These included: Horizontal Flip ↩️: The images were horizontally flipped to create additional samples. Vertical Flip ⬆️: The images were vertically flipped to create additional samples. Rotation 🔄: The images were rotated to create additional samples. Cropping 🪓: A portion of the image was randomly cropped to create additional samples. Gaussian Noise 🌌: Gaussian noise was added to the images to create additional samples. Shearing 🌆: The images were sheared to create additional samples. Contrast Adjustment (Gamma) ⚖️: The gamma correction was applied to the images to adjust their contrast. Contrast Adjustment (Sigmoid) ⚖️: The sigmoid function was applied to the images to adjust their contrast.

Usage

To use the salmon scan dataset in your ML and DL projects, follow these steps:

Clone or download the salmon scan dataset repository from GitHub.

Use standard libraries such as numpy or pandas to convert the images into arrays, which can be input into a machine learning or deep learning model.

Split the dataset into training, validation, and test sets as per your requirement.

Preprocess the data as needed, such as resizing and normalizing the images.

Train your ML/DL model using the preprocessed training data.

Evaluate the model on the test set and make predictions on new, unseen data.
p
A multimodal dental dataset facilitating machine learning research and...
physionet.org
oppositeofnorth.com
Updated Oct 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenjing Liu; Yunyou Huang; Suqin Tang (2024). A multimodal dental dataset facilitating machine learning research and clinic services [Dataset]. http://doi.org/10.13026/h1tt-fc69
Explore at:
Unique identifier
https://doi.org/10.13026/h1tt-fc69
Dataset updated
Oct 11, 2024
Authors
Wenjing Liu; Yunyou Huang; Suqin Tang
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
Oral diseases affect nearly 3.5 billion people, with the majority residing in low- and middle-income countries. Due to limited healthcare resources, many individuals are unable to access proper oral healthcare services. Image-based machine learning technology is one of the most promising approaches to improving oral healthcare services and reducing patient costs. Openly accessible datasets play a crucial role in facilitating the development of machine learning techniques. However, existing dental datasets have limitations such as a scarcity of Cone Beam Computed Tomography (CBCT) data, lack of matched multi-modal data, and insufficient complexity and diversity of the data. This project addresses these challenges by providing a dataset that includes 329 CBCT images from 169 patients, multi-modal data with matching modalities, and images representing various oral health conditions.

Facebook

Twitter

Click to copy link

Link copied

Cite

Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning

Machine Learning Dataset

Explore at:

.json, .csv, .xlsxAvailable download formats

Dataset updated

Dec 23, 2024

Dataset authored and provided by

Bright Datahttps://brightdata.com/

License

https://brightdata.com/licensehttps://brightdata.com/license

Area covered

Worldwide

Description

Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.

Clear search

Close search

Google apps

Main menu

Machine Learning Dataset

A Dataset for Machine Learning Algorithm Development

SYNERGY - Open machine learning dataset on study selection in systematic...

Artificial Intelligence Training Dataset Report

AI & ML Popularity Index

Datasets

A dataset for machine learning research in the field of stress analyses of...

Airoboros LLMs Math Dataset

Airoboros LLMs Math Dataset

Mastering Complex Mathematical Operations in Machine Learning

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...

Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

MNIST

Android malware dataset for machine learning 2

Encrypted Traffic Feature Dataset for Machine Learning and Deep Learning...

Titanic Dataset - Machine Learning from Disaster

Overview

The data has been split into two groups:

Data Dictionary:

Variable Notes

Insurance Data for Machine Learning

OpenABC-D: A Large-Scale Dataset For Machine Learning Guided Integrated...

Data from: Assessing predictive performance of supervised machine learning...

Data from: Training dataset for NABat Machine Learning V1.0

Data from: SalmonScan: A Novel Image Dataset for Machine Learning and Deep...

A multimodal dental dataset facilitating machine learning research and...

Machine Learning Dataset