100+ datasets found
  1. d

    A Dataset for Machine Learning Algorithm Development

    • catalog.data.gov
    • fisheries.noaa.gov
    Updated May 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact, Custodian) (2024). A Dataset for Machine Learning Algorithm Development [Dataset]. https://catalog.data.gov/dataset/a-dataset-for-machine-learning-algorithm-development2
    Explore at:
    Dataset updated
    May 1, 2024
    Dataset provided by
    (Point of Contact, Custodian)
    Description

    This dataset consists of imagery, imagery footprints, associated ice seal detections and homography files associated with the KAMERA Test Flights conducted in 2019. This dataset was subset to include relevant data for detection algorithm development. This dataset is limited to data collected during flights 4, 5, 6 and 7 from our 2019 surveys.

  2. Machine Learning Dataset

    • brightdata.com
    .json, .csv, .xlsx
    Updated Dec 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Dec 23, 2024
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.

  3. Machine Learning Basics for Beginners🤖🧠

    • kaggle.com
    zip
    Updated Jun 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhanupratap Biswas (2023). Machine Learning Basics for Beginners🤖🧠 [Dataset]. https://www.kaggle.com/datasets/bhanupratapbiswas/machine-learning-basics-for-beginners
    Explore at:
    zip(492015 bytes)Available download formats
    Dataset updated
    Jun 22, 2023
    Authors
    Bhanupratap Biswas
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:

    1. Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.

    2. Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.

    3. Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.

    4. Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).

    5. Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).

    6. Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.

    7. Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.

    8. Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.

    9. Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.

    10. Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.

    These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.

  4. Data from: MLOmics: Cancer Multi-Omics Database for Machine Learning

    • figshare.com
    bin
    Updated May 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rikuto Kotoge (2025). MLOmics: Cancer Multi-Omics Database for Machine Learning [Dataset]. http://doi.org/10.6084/m9.figshare.28729127.v2
    Explore at:
    binAvailable download formats
    Dataset updated
    May 25, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Rikuto Kotoge
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Framing the investigation of diverse cancers as a machine learning problem has recently shown significant potential in multi-omics analysis and cancer research. Empowering these successful machine learning models are the high-quality training datasets with sufficient data volume and adequate preprocessing. However, while there exist several public data portals including The Cancer Genome Atlas (TCGA) multi-omics initiative or open-bases such as the LinkedOmics, these databases are not off-the-shelf for existing machine learning models. we propose MLOmics, an open cancer multi-omics database aiming at serving better the development and evaluation of bioinformatics and machine learning models. MLOmics contains 8,314 patient samples covering all 32 cancer types with four omics types, stratified features, and extensive baselines. Complementary support for downstream analysis and bio-knowledge linking are also included to support interdisciplinary analysis.

  5. Data for Machine Learning

    • kaggle.com
    zip
    Updated Nov 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jiankang wang663 (2024). Data for Machine Learning [Dataset]. https://www.kaggle.com/datasets/jiankangwang663/data-for-machine-learning
    Explore at:
    zip(25942 bytes)Available download formats
    Dataset updated
    Nov 20, 2024
    Authors
    jiankang wang663
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by jiankang wang663

    Released under MIT

    Contents

    Data for machine learning course.

  6. D

    SYNERGY - Open machine learning dataset on study selection in systematic...

    • dataverse.nl
    csv, json, txt, zip
    Updated Apr 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan De Bruin; Jonathan De Bruin; Yongchao Ma; Yongchao Ma; Gerbrich Ferdinands; Gerbrich Ferdinands; Jelle Teijema; Jelle Teijema; Rens Van de Schoot; Rens Van de Schoot (2023). SYNERGY - Open machine learning dataset on study selection in systematic reviews [Dataset]. http://doi.org/10.34894/HE6NAQ
    Explore at:
    txt(212), json(702), zip(16028323), json(19426), txt(263), zip(3560967), txt(305), json(470), txt(279), zip(2355371), json(23201), csv(460956), txt(200), json(685), json(546), csv(63996), zip(2989015), zip(5749455), txt(331), txt(315), json(691), json(23775), csv(672721), json(468), txt(415), json(22778), csv(31919), csv(746832), json(18392), zip(62992826), csv(234822), txt(283), zip(34788857), json(475), txt(242), json(533), csv(42227), json(24548), zip(738232), json(22477), json(25491), zip(11463283), json(17741), csv(490660), json(19662), json(578), csv(19786), zip(14708207), zip(24619707), zip(2404439), json(713), json(27224), json(679), json(26426), txt(185), json(906), zip(18534723), json(23550), txt(266), txt(317), zip(6019723), json(33943), txt(436), csv(388378), json(469), zip(2106498), txt(320), csv(451336), txt(338), zip(19428163), json(14326), json(31652), txt(299), csv(96153), txt(220), csv(114789), json(15452), csv(5372708), json(908), csv(317928), csv(150923), json(465), csv(535584), json(26090), zip(8164831), json(19633), txt(316), json(23494), csv(133950), json(18638), csv(3944082), json(15345), json(473), zip(4411063), zip(10396095), zip(835096), txt(255), json(699), csv(654705), txt(294), csv(989865), zip(1028035), txt(322), zip(15085090), txt(237), txt(310), json(756), json(30628), json(19490), json(25908), txt(401), json(701), zip(5543909), json(29397), zip(14007470), json(30058), zip(58869042), csv(852937), json(35711), csv(298011), csv(187163), txt(258), zip(3526740), json(568), json(21552), zip(66466788), csv(215250), json(577), csv(103010), txt(306), zip(11840006)Available download formats
    Dataset updated
    Apr 24, 2023
    Dataset provided by
    DataverseNL
    Authors
    Jonathan De Bruin; Jonathan De Bruin; Yongchao Ma; Yongchao Ma; Gerbrich Ferdinands; Gerbrich Ferdinands; Jelle Teijema; Jelle Teijema; Rens Van de Schoot; Rens Van de Schoot
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    SYNERGY is a free and open dataset on study selection in systematic reviews, comprising 169,288 academic works from 26 systematic reviews. Only 2,834 (1.67%) of the academic works in the binary classified dataset are included in the systematic reviews. This makes the SYNERGY dataset a unique dataset for the development of information retrieval algorithms, especially for sparse labels. Due to the many available variables available per record (i.e. titles, abstracts, authors, references, topics), this dataset is useful for researchers in NLP, machine learning, network analysis, and more. In total, the dataset contains 82,668,134 trainable data points. The easiest way to get the SYNERGY dataset is via the synergy-dataset Python package. See https://github.com/asreview/synergy-dataset for all information.

  7. d

    Data from: Training dataset for NABat Machine Learning V1.0

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
    Explore at:
    Dataset updated
    Nov 26, 2025
    Dataset provided by
    U.S. Geological Survey
    Description

    Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

  8. Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

    GitHub page: https://github.com/soarsmu/NICHE

  9. A

    Artificial Intelligence Training Dataset Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated May 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Artificial Intelligence Training Dataset Report [Dataset]. https://www.datainsightsmarket.com/reports/artificial-intelligence-training-dataset-1958994
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    May 3, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global Artificial Intelligence (AI) Training Dataset market is experiencing robust growth, driven by the increasing adoption of AI across diverse sectors. The market's expansion is fueled by the burgeoning need for high-quality data to train sophisticated AI algorithms capable of powering applications like smart campuses, autonomous vehicles, and personalized healthcare solutions. The demand for diverse dataset types, including image classification, voice recognition, natural language processing, and object detection datasets, is a key factor contributing to market growth. While the exact market size in 2025 is unavailable, considering a conservative estimate of a $10 billion market in 2025 based on the growth trend and reported market sizes of related industries, and a projected CAGR (Compound Annual Growth Rate) of 25%, the market is poised for significant expansion in the coming years. Key players in this space are leveraging technological advancements and strategic partnerships to enhance data quality and expand their service offerings. Furthermore, the increasing availability of cloud-based data annotation and processing tools is further streamlining operations and making AI training datasets more accessible to businesses of all sizes. Growth is expected to be particularly strong in regions with burgeoning technological advancements and substantial digital infrastructure, such as North America and Asia Pacific. However, challenges such as data privacy concerns, the high cost of data annotation, and the scarcity of skilled professionals capable of handling complex datasets remain obstacles to broader market penetration. The ongoing evolution of AI technologies and the expanding applications of AI across multiple sectors will continue to shape the demand for AI training datasets, pushing this market toward higher growth trajectories in the coming years. The diversity of applications—from smart homes and medical diagnoses to advanced robotics and autonomous driving—creates significant opportunities for companies specializing in this market. Maintaining data quality, security, and ethical considerations will be crucial for future market leadership.

  10. Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

    • zenodo.org
    csv
    Updated Sep 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous authors; Anonymous authors (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.6607065
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 15, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous authors; Anonymous authors
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

    The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

    Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

    The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.

    Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

    As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

    The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.

  11. machine-learning dataset

    • figshare.com
    xlsx
    Updated Sep 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    zhang xin (2023). machine-learning dataset [Dataset]. http://doi.org/10.6084/m9.figshare.24115383.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Sep 10, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    zhang xin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is used to train machine learning model for the study of passivation effect of small molecules

  12. R

    Data from: Project Machine Learning Dataset

    • universe.roboflow.com
    zip
    Updated Jun 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    soda (2024). Project Machine Learning Dataset [Dataset]. https://universe.roboflow.com/soda-fj5ov/project-machine-learning-8sjsi
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 6, 2024
    Dataset authored and provided by
    soda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Deteksi Rempah Rempah Bounding Boxes
    Description

    Project Machine Learning

    ## Overview
    
    Project Machine Learning is a dataset for object detection tasks - it contains Deteksi Rempah Rempah annotations for 1,270 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  13. n

    Data from: Assessing predictive performance of supervised machine learning...

    • data.niaid.nih.gov
    • datasetcatalog.nlm.nih.gov
    • +1more
    zip
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evans Omondi (2023). Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model [Dataset]. http://doi.org/10.5061/dryad.wh70rxwrh
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 23, 2023
    Dataset provided by
    Strathmore University
    Authors
    Evans Omondi
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The diamond is 58 times harder than any other mineral in the world, and its elegance as a jewel has long been appreciated. Forecasting diamond prices is challenging due to nonlinearity in important features such as carat, cut, clarity, table, and depth. Against this backdrop, the study conducted a comparative analysis of the performance of multiple supervised machine learning models (regressors and classifiers) in predicting diamond prices. Eight supervised machine learning algorithms were evaluated in this work including Multiple Linear Regression, Linear Discriminant Analysis, eXtreme Gradient Boosting, Random Forest, k-Nearest Neighbors, Support Vector Machines, Boosted Regression and Classification Trees, and Multi-Layer Perceptron. The analysis is based on data preprocessing, exploratory data analysis (EDA), training the aforementioned models, assessing their accuracy, and interpreting their results. Based on the performance metrics values and analysis, it was discovered that eXtreme Gradient Boosting was the most optimal algorithm in both classification and regression, with a R2 score of 97.45% and an Accuracy value of 74.28%. As a result, eXtreme Gradient Boosting was recommended as the optimal regressor and classifier for forecasting the price of a diamond specimen. Methods Kaggle, a data repository with thousands of datasets, was used in the investigation. It is an online community for machine learning practitioners and data scientists, as well as a robust, well-researched, and sufficient resource for analyzing various data sources. On Kaggle, users can search for and publish various datasets. In a web-based data-science environment, they can study datasets and construct models.

  14. UCI and OpenML Data Sets for Ordinal Quantification

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jul 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

    With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

    We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

    Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

    Usage

    You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

    Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

    Data Extraction: In your terminal, you can call either

    make

    (recommended), or

    julia --project="." --eval "using Pkg; Pkg.instantiate()"
    julia --project="." extract-oq.jl

    Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

    Further Reading

    Implementation of our experiments: https://github.com/mirkobunse/regularized-oq

  15. G

    In-Database Machine Learning Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Aug 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). In-Database Machine Learning Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/in-database-machine-learning-market
    Explore at:
    csv, pdf, pptxAvailable download formats
    Dataset updated
    Aug 4, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    In-Database Machine Learning Market Outlook



    According to our latest research, the global in-database machine learning market size in 2024 stands at USD 2.74 billion, reflecting the sector’s rapid adoption across diverse industries. The market is expected to grow at a robust CAGR of 28.6% from 2025 to 2033, reaching a projected value of USD 24.19 billion by the end of the forecast period. This exceptional growth is primarily driven by the increasing demand for advanced analytics, real-time data processing, and the seamless integration of machine learning capabilities directly within database environments, which are essential for accelerating business insights and operational efficiency.




    The primary growth factor propelling the in-database machine learning market is the exponential surge in data volumes generated by enterprises worldwide. As organizations transition to digital-first operations, the need to analyze vast datasets in real time has become paramount. Traditional machine learning workflows, which require data extraction and movement to external environments, are increasingly seen as inefficient and prone to latency and security issues. In-database machine learning eliminates these bottlenecks by enabling algorithms to run directly within the database, thus reducing data movement, minimizing latency, and ensuring higher data security. This approach not only streamlines the analytics pipeline but also empowers businesses to derive actionable insights faster, supporting critical functions such as fraud detection, predictive maintenance, and customer personalization.




    Another significant factor fueling market expansion is the growing adoption of cloud-based data platforms and the proliferation of hybrid IT infrastructures. Enterprises are leveraging cloud-native databases and data warehouses to centralize and scale their analytics capabilities. In-database machine learning solutions are designed to seamlessly integrate with these modern architectures, allowing organizations to harness the power of machine learning without the need for extensive data migration or IT overhead. This integration facilitates agile development, lowers total cost of ownership, and enables organizations to respond swiftly to market changes. Furthermore, the rise of open-source machine learning frameworks and APIs has democratized access to advanced analytics, making it easier for businesses of all sizes to implement and benefit from in-database ML capabilities.




    A third pivotal growth driver is the increasing emphasis on regulatory compliance, data privacy, and security in highly regulated industries such as BFSI and healthcare. In-database machine learning offers a compelling solution by keeping sensitive data within secure database environments, thereby reducing the risk of data breaches and ensuring compliance with stringent data protection regulations such as GDPR and HIPAA. This capability is particularly valuable for organizations operating in regions with complex regulatory landscapes, where data residency and sovereignty are critical concerns. As a result, the adoption of in-database ML is accelerating among enterprises that prioritize security, governance, and auditability in their analytics workflows.




    From a regional perspective, North America continues to dominate the in-database machine learning market, accounting for the largest revenue share in 2024, followed closely by Europe and Asia Pacific. The presence of leading technology vendors, early adoption of advanced analytics, and a mature digital infrastructure contribute to North America’s leadership. However, rapid economic development, digitization initiatives, and expanding IT ecosystems in Asia Pacific are positioning the region as a significant growth engine for the forecast period. Meanwhile, Europe’s focus on data privacy and innovation is driving substantial investments in secure and compliant in-database ML solutions, further fueling market growth across the continent.





    Component Analysis



    The in-database machine learning mark

  16. d

    Prediction data from: Machine learning predicts which rivers, streams, and...

    • datadryad.org
    • dataone.org
    • +1more
    zip
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simon Greenhill; Hannah Druckenmiller; Sherrie Wang; David Keiser; Manuela Girotto; Jason Moore; Nobuhiro Yamaguchi; Alberto Todeschini; Joseph Shapiro (2023). Prediction data from: Machine learning predicts which rivers, streams, and wetlands the Clean Water Act regulates [Dataset]. http://doi.org/10.5061/dryad.z34tmpgm7
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset provided by
    Dryad
    Authors
    Simon Greenhill; Hannah Druckenmiller; Sherrie Wang; David Keiser; Manuela Girotto; Jason Moore; Nobuhiro Yamaguchi; Alberto Todeschini; Joseph Shapiro
    Time period covered
    Sep 27, 2023
    Description

    This dataset contains model outputs that were analyzed to produce the main results of the paper.

  17. i

    Big Data Machine Learning Benchmark on Spark

    • ieee-dataport.org
    Updated Jun 6, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jairson Rodrigues (2019). Big Data Machine Learning Benchmark on Spark [Dataset]. https://ieee-dataport.org/open-access/big-data-machine-learning-benchmark-spark
    Explore at:
    Dataset updated
    Jun 6, 2019
    Authors
    Jairson Rodrigues
    Description

    net traffic

  18. d

    Data from: USGS Contributions to the Nevada Geothermal Machine Learning...

    • catalog.data.gov
    • datasets.ai
    Updated Nov 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). USGS Contributions to the Nevada Geothermal Machine Learning Project (DE-FOA-0001956): Slip and Dilation Tendency Data [Dataset]. https://catalog.data.gov/dataset/usgs-contributions-to-the-nevada-geothermal-machine-learning-project-de-foa-0001956-slip-a
    Explore at:
    Dataset updated
    Nov 19, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    This package contains data in a portion of northern Nevada, the extent of the ‘Nevada Machine Learning Project’ (DE-EE0008762). Slip tendency (TS) and dilation tendency (TD) were calculated for the all the faults in the Nevada ML study area. TS is the ratio between the shear components of the stress tensor and the normal components of the stress tensor acting on a fault plane. TD is the ratio of all the components of the stress tensor that are normal to a fault plane. Faults with higher TD are relatively more likely to dilate and host open, conductive fractures. Faults with higher TS are relatively more likely to slip, and these fractures may be propped open and conductive. These values of TS and TD were used to update a map surface from the Nevada Geothermal Machine Learning Project (DE-FOA-0001956) that used less reliable estimates for TS and TD. The new map surface was generated using the same procedure as the old surface, just with the new TS and TD data values.

  19. d

    Data from: Brook trout imagery data for individual recognition with deep...

    • catalog.data.gov
    • data.usgs.gov
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Brook trout imagery data for individual recognition with deep learning [Dataset]. https://catalog.data.gov/dataset/brook-trout-imagery-data-for-individual-recognition-with-deep-learning
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset provided by
    U.S. Geological Survey
    Description

    This Data Release provides imagery data for the development of deep-learning models to recognize individual brook trout (n=435). Images were collected at the Paint Bank State Fish Hatchery (Paint Bank, VA) on August 9, 2021 using a GoPro Hero 9 camera mounted approximately 50 cm above a fish board. The Paint Bank State Fish Hatchery is operated by the Virginia Department of Wildlife Resources.

  20. m

    A dataset for machine learning research in the field of stress analyses of...

    • data.mendeley.com
    • narcis.nl
    Updated Jul 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jaroslav Matej (2020). A dataset for machine learning research in the field of stress analyses of mechanical structures [Dataset]. http://doi.org/10.17632/wzbzznk8z3.2
    Explore at:
    Dataset updated
    Jul 25, 2020
    Authors
    Jaroslav Matej
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is prepared and intended as a data source for development of a stress analysis method based on machine learning. It consists of finite element stress analyses of randomly generated mechanical structures. The dataset contains more than 270,794 pairs of stress analyses images (von Mises stress) of randomly generated 2D structures with predefined thickness and material properties. All the structures are fixed at their bottom edges and loaded with gravity force only. See PREVIEW directory with some examples. The zip file contains all the files in the dataset.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(Point of Contact, Custodian) (2024). A Dataset for Machine Learning Algorithm Development [Dataset]. https://catalog.data.gov/dataset/a-dataset-for-machine-learning-algorithm-development2

A Dataset for Machine Learning Algorithm Development

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
May 1, 2024
Dataset provided by
(Point of Contact, Custodian)
Description

This dataset consists of imagery, imagery footprints, associated ice seal detections and homography files associated with the KAMERA Test Flights conducted in 2019. This dataset was subset to include relevant data for detection algorithm development. This dataset is limited to data collected during flights 4, 5, 6 and 7 from our 2019 surveys.

Search
Clear search
Close search
Google apps
Main menu