100+ datasets found
  1. Machine learning algorithm validation with a limited sample size

    • plos.figshare.com
    text/x-python
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
    Explore at:
    text/x-pythonAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

  2. Machine Learning Basics for Beginners🤖🧠

    • kaggle.com
    zip
    Updated Jun 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhanupratap Biswas (2023). Machine Learning Basics for Beginners🤖🧠 [Dataset]. https://www.kaggle.com/datasets/bhanupratapbiswas/machine-learning-basics-for-beginners
    Explore at:
    zip(492015 bytes)Available download formats
    Dataset updated
    Jun 22, 2023
    Authors
    Bhanupratap Biswas
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:

    1. Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.

    2. Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.

    3. Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.

    4. Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).

    5. Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).

    6. Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.

    7. Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.

    8. Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.

    9. Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.

    10. Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.

    These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.

  3. Machine Learning Dataset

    • brightdata.com
    .json, .csv, .xlsx
    Updated Dec 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Dec 23, 2024
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.

  4. d

    A Dataset for Machine Learning Algorithm Development

    • catalog.data.gov
    • fisheries.noaa.gov
    Updated May 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact, Custodian) (2024). A Dataset for Machine Learning Algorithm Development [Dataset]. https://catalog.data.gov/dataset/a-dataset-for-machine-learning-algorithm-development2
    Explore at:
    Dataset updated
    May 1, 2024
    Dataset provided by
    (Point of Contact, Custodian)
    Description

    This dataset consists of imagery, imagery footprints, associated ice seal detections and homography files associated with the KAMERA Test Flights conducted in 2019. This dataset was subset to include relevant data for detection algorithm development. This dataset is limited to data collected during flights 4, 5, 6 and 7 from our 2019 surveys.

  5. Data for Machine Learning

    • kaggle.com
    zip
    Updated Nov 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jiankang wang663 (2024). Data for Machine Learning [Dataset]. https://www.kaggle.com/datasets/jiankangwang663/data-for-machine-learning
    Explore at:
    zip(25942 bytes)Available download formats
    Dataset updated
    Nov 20, 2024
    Authors
    jiankang wang663
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by jiankang wang663

    Released under MIT

    Contents

    Data for machine learning course.

  6. Scraped Data on AI, ML, DS & Big Data Jobs

    • kaggle.com
    zip
    Updated Jun 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joy Shil (2023). Scraped Data on AI, ML, DS & Big Data Jobs [Dataset]. https://www.kaggle.com/datasets/joyshil0599/data-science-jobs-comprehensive-dataset/data
    Explore at:
    zip(132015 bytes)Available download formats
    Dataset updated
    Jun 18, 2023
    Authors
    Joy Shil
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Scraped Data on AI, ML, DS & Big Data Jobs is a comprehensive dataset that includes valuable information about job opportunities in the fields of Artificial Intelligence (AI), Machine Learning (ML), Data Science (DS), and Big Data. The dataset covers various aspects, including company names, job titles, locations, job types (full-time, part-time, remote), experience levels, salary ranges, job requirements, and available facilities.

    This dataset offers a wealth of insights for job seekers, researchers, and organizations interested in the rapidly evolving fields of AI, ML, DS, and Big Data. By analyzing the data, users can gain a better understanding of the job market trends, geographical distribution of opportunities, popular job titles, required skills and qualifications, salary expectations, and the types of facilities provided by companies in these domains.

    Whether you are exploring career prospects, conducting market research, or building predictive models, this dataset serves as a valuable resource to extract meaningful insights and make informed decisions in the exciting world of AI, ML, DS, and Big Data jobs.

  7. d

    Data from: USGS Contributions to the Nevada Geothermal Machine Learning...

    • catalog.data.gov
    • datasets.ai
    Updated Nov 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). USGS Contributions to the Nevada Geothermal Machine Learning Project (DE-FOA-0001956): Slip and Dilation Tendency Data [Dataset]. https://catalog.data.gov/dataset/usgs-contributions-to-the-nevada-geothermal-machine-learning-project-de-foa-0001956-slip-a
    Explore at:
    Dataset updated
    Nov 19, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    This package contains data in a portion of northern Nevada, the extent of the ‘Nevada Machine Learning Project’ (DE-EE0008762). Slip tendency (TS) and dilation tendency (TD) were calculated for the all the faults in the Nevada ML study area. TS is the ratio between the shear components of the stress tensor and the normal components of the stress tensor acting on a fault plane. TD is the ratio of all the components of the stress tensor that are normal to a fault plane. Faults with higher TD are relatively more likely to dilate and host open, conductive fractures. Faults with higher TS are relatively more likely to slip, and these fractures may be propped open and conductive. These values of TS and TD were used to update a map surface from the Nevada Geothermal Machine Learning Project (DE-FOA-0001956) that used less reliable estimates for TS and TD. The new map surface was generated using the same procedure as the old surface, just with the new TS and TD data values.

  8. Data sources used by companies for training AI models South Korea 2024

    • statista.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, Data sources used by companies for training AI models South Korea 2024 [Dataset]. https://www.statista.com/statistics/1452822/south-korea-data-sources-for-training-artificial-intelligence-models/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Sep 2024 - Nov 2024
    Area covered
    South Korea
    Description

    As of 2024, customer data was the leading source of information used to train artificial intelligence (AI) models in South Korea, with nearly ** percent of surveyed companies answering that way. About ** percent responded to use public sector support initiatives.

  9. d

    80K+ Construction Site Images | AI Training Data | Machine Learning (ML)...

    • datarade.ai
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Seeds, 80K+ Construction Site Images | AI Training Data | Machine Learning (ML) data | Object & Scene Detection | Global Coverage [Dataset]. https://datarade.ai/data-products/50k-construction-site-images-ai-training-data-machine-le-data-seeds
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset authored and provided by
    Data Seeds
    Area covered
    Russian Federation, Senegal, Tunisia, Guatemala, Swaziland, Grenada, United Arab Emirates, Peru, Venezuela (Bolivarian Republic of), Kenya
    Description

    This dataset features over 80,000 high-quality images of construction sites sourced from photographers worldwide. Built to support AI and machine learning applications, it delivers richly annotated and visually diverse imagery capturing real-world construction environments, machinery, and processes.

    Key Features: 1. Comprehensive Metadata: the dataset includes full EXIF data such as aperture, ISO, shutter speed, and focal length. Each image is annotated with construction phase, equipment types, safety indicators, and human activity context—making it ideal for object detection, site monitoring, and workflow analysis. Popularity metrics based on performance on our proprietary platform are also included.

    1. Unique Sourcing Capabilities: images are collected through a proprietary gamified platform, with competitions focused on industrial, construction, and labor themes. Custom datasets can be generated within 72 hours to target specific scenarios, such as building types, stages (excavation, framing, finishing), regions, or safety compliance visuals.

    2. Global Diversity: sourced from contributors in over 100 countries, the dataset reflects a wide range of construction practices, materials, climates, and regulatory environments. It includes residential, commercial, industrial, and infrastructure projects from both urban and rural areas.

    3. High-Quality Imagery: includes a mix of wide-angle site overviews, close-ups of tools and equipment, drone shots, and candid human activity. Resolution varies from standard to ultra-high-definition, supporting both macro and contextual analysis.

    4. Popularity Scores: each image is assigned a popularity score based on its performance in GuruShots competitions. These scores provide insight into visual clarity, engagement value, and human interest—useful for safety-focused or user-facing AI models.

    5. AI-Ready Design: this dataset is structured for training models in real-time object detection (e.g., helmets, machinery), construction progress tracking, material identification, and safety compliance. It’s compatible with standard ML frameworks used in construction tech.

    6. Licensing & Compliance: fully compliant with privacy, labor, and workplace imagery regulations. Licensing is transparent and ready for commercial or research deployment.

    Use Cases: 1. Training AI for safety compliance monitoring and PPE detection. 2. Powering progress tracking and material usage analysis tools. 3. Supporting site mapping, autonomous machinery, and smart construction platforms. 4. Enhancing augmented reality overlays and digital twin models for construction planning.

    This dataset provides a comprehensive, real-world foundation for AI innovation in construction technology, safety, and operational efficiency. Custom datasets are available on request. Contact us to learn more!

  10. D

    Machine Learning Market Report | Global Forecast From 2025 To 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Jan 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Machine Learning Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/machine-learning-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Machine Learning Market Outlook



    The global machine learning market is projected to witness a remarkable growth trajectory, with the market size estimated to reach USD 21.17 billion in 2023 and anticipated to expand to USD 209.91 billion by 2032, growing at a compound annual growth rate (CAGR) of 29.2% over the forecast period. This extraordinary growth is primarily propelled by the escalating demand for artificial intelligence-driven solutions across various industries. As businesses seek to leverage machine learning for improving operational efficiency, enhancing customer experience, and driving innovation, the market is poised to expand rapidly. Key factors contributing to this growth include advancements in data generation, increasing computational power, and the proliferation of big data analytics.



    A pivotal growth factor for the machine learning market is the ongoing digital transformation across industries. Enterprises globally are increasingly adopting machine learning technologies to optimize their operations, streamline processes, and make data-driven decisions. The healthcare sector, for example, leverages machine learning for predictive analytics to improve patient outcomes, while the finance sector uses machine learning algorithms for fraud detection and risk assessment. The retail industry is also utilizing machine learning for personalized customer experiences and inventory management. The ability of machine learning to analyze vast amounts of data in real-time and provide actionable insights is fueling its adoption across various applications, thereby driving market growth.



    Another significant growth driver is the increasing integration of machine learning with the Internet of Things (IoT). The convergence of these technologies enables the creation of smarter, more efficient systems that enhance operational performance and productivity. In manufacturing, for instance, IoT devices equipped with machine learning capabilities can predict equipment failures and optimize maintenance schedules, leading to reduced downtime and costs. Similarly, in the automotive industry, machine learning algorithms are employed in autonomous vehicles to process and analyze sensor data, improving navigation and safety. The synergistic relationship between machine learning and IoT is expected to further propel market expansion during the forecast period.



    Moreover, the rising investments in AI research and development by both public and private sectors are accelerating the advancement and adoption of machine learning technologies. Governments worldwide are recognizing the potential of AI and machine learning to transform industries, leading to increased funding for research initiatives and innovation centers. Companies are also investing heavily in developing cutting-edge machine learning solutions to maintain a competitive edge. This robust investment landscape is fostering an environment conducive to technological breakthroughs, thereby contributing to the growth of the machine learning market.



    Supervised Learning, a subset of machine learning, plays a crucial role in the advancement of AI-driven solutions. It involves training algorithms on a labeled dataset, allowing the model to learn and make predictions or decisions based on new, unseen data. This approach is particularly beneficial in applications where the desired output is known, such as in classification or regression tasks. For instance, in the healthcare sector, supervised learning algorithms are employed to analyze patient data and predict health outcomes, thereby enhancing diagnostic accuracy and treatment efficacy. Similarly, in finance, these algorithms are used for credit scoring and fraud detection, providing financial institutions with reliable tools for risk assessment. As the demand for precise and efficient AI applications grows, the significance of supervised learning in driving innovation and operational excellence across industries becomes increasingly evident.



    From a regional perspective, North America holds a dominant position in the machine learning market due to the early adoption of advanced technologies and the presence of major technology companies. The region's strong focus on R&D and innovation, coupled with a well-established IT infrastructure, further supports market growth. In addition, Asia Pacific is emerging as a lucrative market for machine learning, driven by rapid industrialization, increasing digitalization, and government initiatives promoting AI adoption. The region is witnessing significant investments in AI technologies, particu

  11. d

    Data from: Delaware River Basin Stream Salinity Machine Learning Models and...

    • catalog.data.gov
    • data.usgs.gov
    Updated Nov 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Delaware River Basin Stream Salinity Machine Learning Models and Data [Dataset]. https://catalog.data.gov/dataset/delaware-river-basin-stream-salinity-machine-learning-models-and-data
    Explore at:
    Dataset updated
    Nov 26, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    This model archive contains the input data, model code, and model outputs for machine learning models that predict daily non-tidal stream salinity (specific conductance) for a network of 459 modeled stream segments across the Delaware River Basin (DRB) from 1984-09-30 to 2021-12-31. There are a total of twelve models from combinations of two machine learning models (Random Forest and Recurrent Graph Convolution Neural Networks), two training/testing partitions (spatial and temporal), and three input attribute sets (dynamic attributes, dynamic and static attributes, and dynamic attributes and a minimum set of static attributes). In addition to the inputs and outputs for non-tidal predictions provided on the landing page, we also provide example predictions for models trained with additional tidal stream segments within the model archive (TidalExample folder), but we do not recommend our models for this use case. Model outputs contained within the model archive include performance metrics, plots of spatial and temporal errors, and Shapley (SHAP) explainable artificial intelligence plots for the best models. The results of these models provide insights into DRB stream segments with elevated salinity, and processes that drive stream salinization across the DRB, which may be used to inform salinity management. This data compilation was funded by the USGS.

  12. Learning Path Index Dataset

    • kaggle.com
    zip
    Updated Nov 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mani Sarkar (2024). Learning Path Index Dataset [Dataset]. https://www.kaggle.com/datasets/neomatrix369/learning-path-index-dataset/code
    Explore at:
    zip(151846 bytes)Available download formats
    Dataset updated
    Nov 6, 2024
    Authors
    Mani Sarkar
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Description

    The Learning Path Index Dataset is a comprehensive collection of byte-sized courses and learning materials tailored for individuals eager to delve into the fields of Data Science, Machine Learning, and Artificial Intelligence (AI), making it an indispensable reference for students, professionals, and educators in the Data Science and AI communities.

    This Kaggle Dataset along with the KaggleX Learning Path Index GitHub Repo were created by the mentors and mentees of Cohort 3 KaggleX BIPOC Mentorship Program (between August 2023 and November 2023, also see this). See Credits section at the bottom of the long description.

    Inspiration

    This dataset was created out of a commitment to facilitate learning and growth within the Data Science, Machine Learning, and AI communities. It started off as an idea at the end of Cohort 2 of the KaggleX BIPOC Mentorship Program brainstorming and feedback session. It was one of the ideas to create byte-sized learning material to help our KaggleX mentees learn things faster. It aspires to simplify the process of finding, evaluating, and selecting the most fitting educational resources.

    Context

    This dataset was meticulously curated to assist learners in navigating the vast landscape of Data Science, Machine Learning, and AI education. It serves as a compass for those aiming to develop their skills and expertise in these rapidly evolving fields.

    The mentors and mentees communicated via Discord, Trello, Google Hangout, etc... to put together these artifacts and made them public for everyone to use and contribute back.

    Sources

    The dataset compiles data from a curated selection of reputable sources including leading educational platforms such as Google Developer, Google Cloud Skill Boost, IBM, Fast AI, etc. By drawing from these trusted sources, we ensure that the data is both accurate and pertinent. The raw data and other artifacts as a result of this exercise can be found on the GitHub Repo i.e. KaggleX Learning Path Index GitHub Repo.

    Content

    The dataset encompasses the following attributes:

    • Course / Learning Material: The title of the Data Science, Machine Learning, or AI course or learning material.
    • Source: The provider or institution offering the course.
    • Course Level: The proficiency level, ranging from Beginner to Advanced.
    • Type (Free or Paid): Indicates whether the course is available for free or requires payment.
    • Module: Specific module or section within the course.
    • Duration: The estimated time required to complete the module or course.
    • Module / Sub-module Difficulty Level: The complexity level of the module or sub-module.
    • Keywords / Tags / Skills / Interests / Categories: Relevant keywords, tags, or categories associated with the course with a focus on Data Science, Machine Learning, and AI.
    • Links: Hyperlinks to access the course or learning material directly.

    How to contribute to this initiative?

    • You can also join us by taking part in the next KaggleX BIPOC Mentorship program (also see this)
    • Keep your eyes open on the Kaggle Discussions page and other KaggleX social media channels. Or find us on the Kaggle Discord channel to learn more about the next steps
    • Create notebooks from this data
    • Create supplementary or complementary data for or from this dataset
    • Submit corrections/enhancements or anything else to help improve this dataset so it has a wider use and purpose

    License

    The Learning Path Index Dataset is openly shared under a permissive license, allowing users to utilize the data for educational, analytical, and research purposes within the Data Science, Machine Learning, and AI domains. Feel free to fork the dataset and make it your own, we would be delighted if you contributed back to the dataset and/or our KaggleX Learning Path Index GitHub Repo as well.

    Important Links

    Credits

    Credits for all the work done to create this Kaggle Dataset and the KaggleX [Learnin...

  13. Network Digital Twin-Generated Dataset for Machine Learning-Based Detection...

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Nov 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amit Karamchandani Batra; Amit Karamchandani Batra; Javier Nuñez Fuente; Luis de la Cal García; Luis de la Cal García; Yenny Moreno Meneses; Alberto Mozo Velasco; Alberto Mozo Velasco; Antonio Pastor Perales; Antonio Pastor Perales; Diego R. López; Diego R. López; Javier Nuñez Fuente; Yenny Moreno Meneses (2024). Network Digital Twin-Generated Dataset for Machine Learning-Based Detection of Benign and Malicious Heavy Hitter Flows [Dataset]. http://doi.org/10.5281/zenodo.14134646
    Explore at:
    binAvailable download formats
    Dataset updated
    Nov 13, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Amit Karamchandani Batra; Amit Karamchandani Batra; Javier Nuñez Fuente; Luis de la Cal García; Luis de la Cal García; Yenny Moreno Meneses; Alberto Mozo Velasco; Alberto Mozo Velasco; Antonio Pastor Perales; Antonio Pastor Perales; Diego R. López; Diego R. López; Javier Nuñez Fuente; Yenny Moreno Meneses
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jul 11, 2024
    Description

    The dataset used in this study is publicly available for research purposes. If you are using this dataset, please cite the following paper, which outlines the complete details of the dataset and the methodology used for its generation:

    Amit Karamchandani, Javier Núñez, Luis de-la-Cal, Yenny Moreno, Alberto Mozo, Antonio Pastor, "On the Applicability of Network Digital Twins in Generating Synthetic Data for Heavy Hitter Discrimination," under submission.

    This dataset contains a synthetic dataset generated to differentiate between benign and malicious heavy hitter flows within complex network environments. Heavy Hitter flows, which include high-volume data transfers, can significantly impact network performance, leading to congestion and degraded quality of service. Distinguishing legitimate heavy hitter activity from malicious Distributed Denial-of-Service traffic is critical for network management and security, yet existing datasets lack the granularity needed for training machine learning models to effectively make this distinction.

    To address this, a Network Digital Twin (NDT) approach was utilized to emulate realistic network conditions and traffic patterns, enabling automated generation of labeled data for both benign and malicious HH flows alongside regular traffic.

    The feature set includes flow statistics commonly used in network analysis, such as:

    • Traffic protocol type,
    • Flow duration (the time between the initial and final packet in both directions),
    • Total count of payload packets transmitted in both directions,
    • Cumulative bytes transmitted in both directions,
    • Time discrepancy between the first packet observations at the source and destination,
    • Packet and byte transmission rates per second within each interval, and
    • Total packet and byte counts within each interval in both directions.
  14. n

    Data from: Assessing predictive performance of supervised machine learning...

    • data.niaid.nih.gov
    • datasetcatalog.nlm.nih.gov
    • +1more
    zip
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evans Omondi (2023). Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model [Dataset]. http://doi.org/10.5061/dryad.wh70rxwrh
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 23, 2023
    Dataset provided by
    Strathmore University
    Authors
    Evans Omondi
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The diamond is 58 times harder than any other mineral in the world, and its elegance as a jewel has long been appreciated. Forecasting diamond prices is challenging due to nonlinearity in important features such as carat, cut, clarity, table, and depth. Against this backdrop, the study conducted a comparative analysis of the performance of multiple supervised machine learning models (regressors and classifiers) in predicting diamond prices. Eight supervised machine learning algorithms were evaluated in this work including Multiple Linear Regression, Linear Discriminant Analysis, eXtreme Gradient Boosting, Random Forest, k-Nearest Neighbors, Support Vector Machines, Boosted Regression and Classification Trees, and Multi-Layer Perceptron. The analysis is based on data preprocessing, exploratory data analysis (EDA), training the aforementioned models, assessing their accuracy, and interpreting their results. Based on the performance metrics values and analysis, it was discovered that eXtreme Gradient Boosting was the most optimal algorithm in both classification and regression, with a R2 score of 97.45% and an Accuracy value of 74.28%. As a result, eXtreme Gradient Boosting was recommended as the optimal regressor and classifier for forecasting the price of a diamond specimen. Methods Kaggle, a data repository with thousands of datasets, was used in the investigation. It is an online community for machine learning practitioners and data scientists, as well as a robust, well-researched, and sufficient resource for analyzing various data sources. On Kaggle, users can search for and publish various datasets. In a web-based data-science environment, they can study datasets and construct models.

  15. TREC 2022 Deep Learning test collection

    • data.nist.gov
    • gimi9.com
    • +1more
    Updated Mar 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ian Soboroff (2023). TREC 2022 Deep Learning test collection [Dataset]. http://doi.org/10.18434/mds2-2974
    Explore at:
    Dataset updated
    Mar 1, 2023
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Authors
    Ian Soboroff
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision). Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks. Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision? The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.

  16. N

    North America Machine Learning (ML) Market Report

    • marketresearchforecast.com
    doc, pdf, ppt
    Updated Oct 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Forecast (2025). North America Machine Learning (ML) Market Report [Dataset]. https://www.marketresearchforecast.com/reports/north-america-machine-learning-ml-market-1896
    Explore at:
    doc, pdf, pptAvailable download formats
    Dataset updated
    Oct 19, 2025
    Dataset authored and provided by
    Market Research Forecast
    License

    https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    North America
    Variables measured
    Market Size
    Description

    The North America Machine Learning (ML) Market size was valued at USD 19.20 USD billion in 2023 and is projected to reach USD 172.15 USD billion by 2032, exhibiting a CAGR of 36.8 % during the forecast period. The increase in demand for efficient data analytics solutions, the growth of cloud computing, and the proliferation of IoT devices are driving the market's growth. Machine learning (ML) is a discipline of artificial intelligence that provides machines with the ability to automatically learn from data and past experiences while identifying patterns to make predictions with minimal human intervention. Machine learning methods enable computers to operate autonomously without explicit programming. ML applications are fed with new data, and they can independently learn, grow, develop, and adapt. Machine learning derives insightful information from large volumes of data by leveraging algorithms to identify patterns and learn in an iterative process. ML algorithms use computation methods to learn directly from data instead of relying on any predetermined equation that may serve as a model. Machine learning is used today for a wide range of commercial purposes, including suggesting products to consumers based on their past purchases, predicting stock market fluctuations, and translating text from one language to another. The North America Machine Learning (ML) Market is primarily driven by the increasing adoption of essential services like security information and cloud applications. Key drivers for this market are: Growing Adoption of Mobile Commerce to Augment the Demand for Virtual Fitting Room Tool . Potential restraints include: Lack of Privacy and Privacy Violations in AI and ML Applications to Restrain Market Growth. Notable trends are: Growing Implementation of Touch-based and Voice-based Infotainment Systems to Increase Adoption of Intelligent Cars.

  17. d

    M-ART | Video Data | Global | 100,000 Stock videos | Including metadata and...

    • datarade.ai
    Updated Sep 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M-ART (2025). M-ART | Video Data | Global | 100,000 Stock videos | Including metadata and releases | Dataset for AI & ML [Dataset]. https://datarade.ai/data-products/m-art-video-data-global-100-000-stock-videos-includin-m-art
    Explore at:
    .csv, .jpeg, .mp4, .movAvailable download formats
    Dataset updated
    Sep 11, 2025
    Dataset authored and provided by
    M-ART
    Area covered
    Tunisia, Benin, Chad, Paraguay, Curaçao, Bangladesh, Saint Helena, El Salvador, Estonia, Andorra
    Description

    "Collection of 100,000 high-quality video clips across diverse real-world domains, designed to accelerate the training and optimization of computer vision and multimodal AI models."

    Overview This dataset contains 100,000 proprietary and partner-produced video clips filmed in 4K/6K with cinema-grade RED cameras. Each clip is commercially cleared with full releases, structured metadata, and available in RAW or MOV/MP4 formats. The collection spans a wide variety of domains — people and lifestyle, healthcare and medical, food and cooking, office and business, sports and fitness, nature and landscapes, education, and more. This breadth ensures robust training data for computer vision, multimodal, and machine learning projects.

    The data set All 100,000 videos have been reviewed for quality and compliance. The dataset is optimized for AI model training, supporting use cases from face and activity recognition to scene understanding and generative AI. Custom datasets can also be produced on demand, enabling clients to close data gaps with tailored, high-quality content.

    About M-ART M-ART is a leading provider of cinematic-grade datasets for AI training. With extensive expertise in large-scale content production and curation, M-ART delivers both ready-to-use video datasets and fully customized collections. All data is proprietary, rights-cleared, and designed to help global AI leaders accelerate research, development, and deployment of next-generation models.

  18. d

    Factori AI & ML Training Data | People Data | USA | Machine Learning Data

    • datarade.ai
    .json, .csv
    Updated Jul 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Factori (2022). Factori AI & ML Training Data | People Data | USA | Machine Learning Data [Dataset]. https://datarade.ai/data-products/factori-ai-ml-training-data-consumer-data-usa-machine-factori
    Explore at:
    .json, .csvAvailable download formats
    Dataset updated
    Jul 23, 2022
    Dataset authored and provided by
    Factori
    Area covered
    United States of America
    Description

    Our People data is gathered and aggregated via surveys, digital services, and public data sources. We use powerful profiling algorithms to collect and ingest only fresh and reliable data points.

    Our comprehensive data enrichment solution includes a variety of data sets that can help you address gaps in your customer data, gain a deeper understanding of your customers, and power superior client experiences.

    1. Geography - City, State, ZIP, County, CBSA, Census Tract, etc.
    2. Demographics - Gender, Age Group, Marital Status, Language etc.
    3. Financial - Income Range, Credit Rating Range, Credit Type, Net worth Range, etc
    4. Persona - Consumer type, Communication preferences, Family type, etc
    5. Interests - Content, Brands, Shopping, Hobbies, Lifestyle etc.
    6. Household - Number of Children, Number of Adults, IP Address, etc.
    7. Behaviours - Brand Affinity, App Usage, Web Browsing etc.
    8. Firmographics - Industry, Company, Occupation, Revenue, etc
    9. Retail Purchase - Store, Category, Brand, SKU, Quantity, Price etc.
    10. Auto - Car Make, Model, Type, Year, etc.
    11. Housing - Home type, Home value, Renter/Owner, Year Built etc.

    People Data Schema & Reach: Our data reach represents the total number of counts available within various categories and comprises attributes such as country location, MAU, DAU & Monthly Location Pings:

    Data Export Methodology: Since we collect data dynamically, we provide the most updated data and insights via a best-suited method on a suitable interval (daily/weekly/monthly).

    People data Use Cases:

    360-Degree Customer View: Get a comprehensive image of customers by the means of internal and external data aggregation. Data Enrichment: Leverage Online to offline consumer profiles to build holistic audience segments to improve campaign targeting using user data enrichment Fraud Detection: Use multiple digital (web and mobile) identities to verify real users and detect anomalies or fraudulent activity. Advertising & Marketing: Understand audience demographics, interests, lifestyle, hobbies, and behaviors to build targeted marketing campaigns.

    Here's the schema of People Data: person_id first_name last_name age gender linkedin_url twitter_url facebook_url city state address zip zip4 country delivery_point_bar_code carrier_route walk_seuqence_code fips_state_code fips_country_code country_name latitude longtiude address_type metropolitan_statistical_area core_based+statistical_area census_tract census_block_group census_block primary_address pre_address streer post_address address_suffix address_secondline address_abrev census_median_home_value home_market_value property_build+year property_with_ac property_with_pool property_with_water property_with_sewer general_home_value property_fuel_type year month household_id Census_median_household_income household_size marital_status length+of_residence number_of_kids pre_school_kids single_parents working_women_in_house_hold homeowner children adults generations net_worth education_level occupation education_history credit_lines credit_card_user newly_issued_credit_card_user credit_range_new
    credit_cards loan_to_value mortgage_loan2_amount mortgage_loan_type
    mortgage_loan2_type mortgage_lender_code
    mortgage_loan2_render_code
    mortgage_lender mortgage_loan2_lender
    mortgage_loan2_ratetype mortgage_rate
    mortgage_loan2_rate donor investor interest buyer hobby personal_email work_email devices phone employee_title employee_department employee_job_function skills recent_job_change company_id company_name company_description technologies_used office_address office_city office_country office_state office_zip5 office_zip4 office_carrier_route office_latitude office_longitude office_cbsa_code
    office_census_block_group
    office_census_tract office_county_code
    company_phone
    company_credit_score
    company_csa_code
    company_dpbc
    company_franchiseflag
    company_facebookurl company_linkedinurl company_twitterurl
    company_website company_fortune_rank
    company_government_type company_headquarters_branch company_home_business
    company_industry
    company_num_pcs_used
    company_num_employees
    company_firm_individual company_msa company_msa_name
    company_naics_code
    company_naics_description
    company_naics_code2 company_naics_description2
    company_sic_code2
    company_sic_code2_description
    company_sic_code4 company_sic_code4_description
    company_sic_code6
    company_sic_code6_description
    company_sic_code8
    company_sic_code8_description company_parent_company
    company_parent_company_location company_public_private company_subsidiary_company company_residential_business_code company_revenue_at_side_code company_revenue_range
    company_revenue company_sales_volume
    company_small_business company_stock_ticker company_year_founded company_minorityowned
    company_female_owned_or_operated company_franchise_code company_dma company_dma_name
    company_hq_address
    company_hq_city company_hq_duns company_hq_state
    company_hq_zip5 company_hq_zip4 company_se...

  19. Airoboros LLMs Math Dataset

    • kaggle.com
    zip
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Airoboros LLMs Math Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/airoboros-llms-math-dataset
    Explore at:
    zip(36964941 bytes)Available download formats
    Dataset updated
    Nov 24, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Airoboros LLMs Math Dataset

    Mastering Complex Mathematical Operations in Machine Learning

    By Huggingface Hub [source]

    About this dataset

    The Airoboros-3.1 dataset is the perfect tool to help machine learning models excel in the difficult realm of complicated mathematical operations. This data collection features thousands of conversations between machines and humans, formatted in ShareGPT to maximize optimization in an OS ecosystem. The dataset’s focus on advanced subjects like factorials, trigonometry, and larger numerical values will help drive machine learning models to the next level - facilitating critical acquisition of sophisticated mathematical skills that are essential for ML success. As AI technology advances at such a rapid pace, training neural networks to correspondingly move forward can be a daunting and complicated challenge - but with Airoboros-3.1’s powerful datasets designed around difficult mathematical operations it just became one step closer to achievable!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    To get started, download the dataset from Kaggle and use the train.csv file. This file contains over two thousand examples of conversations between ML models and humans which have been formatted using ShareGPT - fast and efficient OS ecosystem fine-tuning tools designed to help with understanding mathematical operations more easily. The file includes two columns: category and conversations, both of which are marked as strings in the data itself.

    Once you have downloaded the train file you can begin setting up your own ML training environment by using any of your preferred frameworks or methods. Your model should focus on predicting what kind of mathematical operations will likely be involved in future conversations by referring back to previous dialogues within this dataset for reference (category column). You can also create your own test sets from this data, adding new conversation topics either by modifying existing rows or creating new ones entirely with conversation topics related to mathematics. Finally, compare your model’s results against other established models or algorithms that are already published online!

    Happy training!

    Research Ideas

    • It can be used to build custom neural networks or machine learning algorithms that are specifically designed for complex mathematical operations.
    • This data set can be used to teach and debug more general-purpose machine learning models to recognize large numbers, and intricate calculations within natural language processing (NLP).
    • The Airoboros-3.1 dataset can also be utilized as a supervised learning task: models could learn from the conversations provided in the dataset how to respond correctly when presented with complex mathematical operations

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:------------------|:-----------------------------------------------------------------------------| | category | The type of mathematical operation being discussed. (String) | | conversations | The conversations between the machine learning model and the human. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  20. d

    Prediction data from: Machine learning predicts which rivers, streams, and...

    • datadryad.org
    • dataone.org
    • +1more
    zip
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simon Greenhill; Hannah Druckenmiller; Sherrie Wang; David Keiser; Manuela Girotto; Jason Moore; Nobuhiro Yamaguchi; Alberto Todeschini; Joseph Shapiro (2023). Prediction data from: Machine learning predicts which rivers, streams, and wetlands the Clean Water Act regulates [Dataset]. http://doi.org/10.5061/dryad.z34tmpgm7
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset provided by
    Dryad
    Authors
    Simon Greenhill; Hannah Druckenmiller; Sherrie Wang; David Keiser; Manuela Girotto; Jason Moore; Nobuhiro Yamaguchi; Alberto Todeschini; Joseph Shapiro
    Time period covered
    Sep 27, 2023
    Description

    This dataset contains model outputs that were analyzed to produce the main results of the paper.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
Organization logo

Machine learning algorithm validation with a limited sample size

Explore at:
text/x-pythonAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

Search
Clear search
Close search
Google apps
Main menu