100+ datasets found
  1. d

    Training dataset for NABat Machine Learning V1.0

    • catalog.data.gov
    • data.usgs.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Description

    Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

  2. d

    BUTTER - Empirical Deep Learning Dataset

    • datasets.ai
    • data.openei.org
    • +2more
    21, 28
    Updated Sep 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Energy (2024). BUTTER - Empirical Deep Learning Dataset [Dataset]. https://datasets.ai/datasets/butter-empirical-deep-learning-dataset
    Explore at:
    28, 21Available download formats
    Dataset updated
    Sep 11, 2024
    Dataset authored and provided by
    Department of Energy
    Description

    The BUTTER Empirical Deep Learning Dataset represents an empirical study of the deep learning phenomena on dense fully connected networks, scanning across thirteen datasets, eight network shapes, fourteen depths, twenty-three network sizes (number of trainable parameters), four learning rates, six minibatch sizes, four levels of label noise, and fourteen levels of L1 and L2 regularization each. Multiple repetitions (typically 30, sometimes 10) of each combination of hyperparameters were preformed, and statistics including training and test loss (using a 80% / 20% shuffled train-test split) are recorded at the end of each training epoch. In total, this dataset covers 178 thousand distinct hyperparameter settings ("experiments"), 3.55 million individual training runs (an average of 20 repetitions of each experiments), and a total of 13.3 billion training epochs (three thousand epochs were covered by most runs). Accumulating this dataset consumed 5,448.4 CPU core-years, 17.8 GPU-years, and 111.2 node-years.

  3. TREC 2022 Deep Learning test collection

    • data.nist.gov
    • s.cnmilf.com
    • +1more
    Updated Mar 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ian Soboroff (2023). TREC 2022 Deep Learning test collection [Dataset]. http://doi.org/10.18434/mds2-2974
    Explore at:
    Dataset updated
    Mar 1, 2023
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Authors
    Ian Soboroff
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision). Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks. Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision? The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.

  4. Artificial Intelligence (AI) Training Dataset Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Artificial Intelligence (AI) Training Dataset Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/artificial-intelligence-training-dataset-market-global-industry-analysis
    Explore at:
    pptx, csv, pdfAvailable download formats
    Dataset updated
    Jun 30, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Artificial Intelligence (AI) Training Dataset Market Outlook



    According to our latest research, the global Artificial Intelligence (AI) Training Dataset market size reached USD 3.15 billion in 2024, reflecting robust industry momentum. The market is expanding at a notable CAGR of 20.8% and is forecasted to attain USD 20.92 billion by 2033. This impressive growth is primarily attributed to the surging demand for high-quality, annotated datasets to fuel machine learning and deep learning models across diverse industry verticals. The proliferation of AI-driven applications, coupled with rapid advancements in data labeling technologies, is further accelerating the adoption and expansion of the AI training dataset market globally.




    One of the most significant growth factors propelling the AI training dataset market is the exponential rise in data-driven AI applications across industries such as healthcare, automotive, retail, and finance. As organizations increasingly rely on AI-powered solutions for automation, predictive analytics, and personalized customer experiences, the need for large, diverse, and accurately labeled datasets has become critical. Enhanced data annotation techniques, including manual, semi-automated, and fully automated methods, are enabling organizations to generate high-quality datasets at scale, which is essential for training sophisticated AI models. The integration of AI in edge devices, smart sensors, and IoT platforms is further amplifying the demand for specialized datasets tailored for unique use cases, thereby fueling market growth.




    Another key driver is the ongoing innovation in machine learning and deep learning algorithms, which require vast and varied training data to achieve optimal performance. The increasing complexity of AI models, especially in areas such as computer vision, natural language processing, and autonomous systems, necessitates the availability of comprehensive datasets that accurately represent real-world scenarios. Companies are investing heavily in data collection, annotation, and curation services to ensure their AI solutions can generalize effectively and deliver reliable outcomes. Additionally, the rise of synthetic data generation and data augmentation techniques is helping address challenges related to data scarcity, privacy, and bias, further supporting the expansion of the AI training dataset market.




    The market is also benefiting from the growing emphasis on ethical AI and regulatory compliance, particularly in data-sensitive sectors like healthcare, finance, and government. Organizations are prioritizing the use of high-quality, unbiased, and diverse datasets to mitigate algorithmic bias and ensure transparency in AI decision-making processes. This focus on responsible AI development is driving demand for curated datasets that adhere to strict quality and privacy standards. Moreover, the emergence of data marketplaces and collaborative data-sharing initiatives is making it easier for organizations to access and exchange valuable training data, fostering innovation and accelerating AI adoption across multiple domains.




    From a regional perspective, North America currently dominates the AI training dataset market, accounting for the largest revenue share in 2024, driven by significant investments in AI research, a mature technology ecosystem, and the presence of leading AI companies and data annotation service providers. Europe and Asia Pacific are also witnessing rapid growth, with increasing government support for AI initiatives, expanding digital infrastructure, and a rising number of AI startups. While North America sets the pace in terms of technological innovation, Asia Pacific is expected to exhibit the highest CAGR during the forecast period, fueled by the digital transformation of emerging economies and the proliferation of AI applications across various industry sectors.





    Data Type Analysis



    The AI training dataset market is segmented by data type into Text, Image/Video, Audio, and Others, each playing a crucial role in powering different AI applications. Text da

  5. m

    LOCBEEF: Beef Quality Image dataset for Deep Learning Models

    • data.mendeley.com
    Updated Nov 30, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tri Mulya Dharma (2022). LOCBEEF: Beef Quality Image dataset for Deep Learning Models [Dataset]. http://doi.org/10.17632/nhs6mjg6yy.1
    Explore at:
    Dataset updated
    Nov 30, 2022
    Authors
    Tri Mulya Dharma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The LOCBEEF dataset contains 3268 images of local Aceh beef collected from 07:00 a.m - 22:00 p.m, more information about the clock is shown in Fig. The dataset contains two categories of directories, namely train, and test. Furthermore, each subdirectory consists of fresh and rotten. An example of the image can be seen in Figs. 2 and 3. The directory structure for the data is shown in Fig. 1. The image directory for train contains 2228 images each subdirectory contains 1114 images, and the test directory contains 980 images for each subdirectory containing 490 images. For images have a resolution of 176 x 144 pixel, 320 x 240 pixel, 640 x 480 pixel, 720 x 480 pixel, 720 x 720 pixel, 1280 x 720 pixel, 1920 x 1080 pixel, 2560 x 1920 pixel, 3120 x 3120 pixel, 3264 x 2248 pixel, and 4160 x 3120 pixel.

    The classification of LOCBEEF datasets has been carried out using the deep learning method of Convolutional Neural Networks with an image composition of 70% training data and 30% test data. Images with the mentioned dimensions are included in the LOCBEEF dataset to apply to the Resnet50.

  6. d

    Process-guided deep learning water temperature predictions: 6 Model...

    • catalog.data.gov
    • data.usgs.gov
    • +2more
    Updated Jun 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Climate Adaptation Science Centers (2024). Process-guided deep learning water temperature predictions: 6 Model evaluation (test data and RMSE) [Dataset]. https://catalog.data.gov/dataset/process-guided-deep-learning-water-temperature-predictions-6-model-evaluation-test-data-an
    Explore at:
    Dataset updated
    Jun 15, 2024
    Dataset provided by
    Climate Adaptation Science Centers
    Description

    This dataset includes evaluation data ("test" data) and performance metrics for water temperature predictions from multiple modeling frameworks. Process-Based (PB) models were configured and calibrated with training data to reduce root-mean squared error. Uncalibrated models used default configurations (PB0; see Winslow et al. 2016 for details) and no parameters were adjusted according to model fit with observations. Deep Learning (DL) models were Long Short-Term Memory artificial recurrent neural network models which used training data to adjust model structure and weights for temperature predictions (Jia et al. 2019). Process-Guided Deep Learning (PGDL) models were DL models with an added physical constraint for energy conservation as a loss term. These models were pre-trained with uncalibrated Process-Based model outputs (PB0) before training on actual temperature observations. Performance was measured as root-mean squared errors relative to temperature observations during the test period. Test data include compiled water temperature data from a variety of sources, including the Water Quality Portal (Read et al. 2017), the North Temperate Lakes Long-TERM Ecological Research Program (https://lter.limnology.wisc.edu/), the Minnesota department of Natural Resources, and the Global Lake Ecological Observatory Network (gleon.org). This dataset is part of a larger data release of lake temperature model inputs and outputs for 68 lakes in the U.S. states of Minnesota and Wisconsin (http://dx.doi.org/10.5066/P9AQPIVD).

  7. AI Training Dataset Market Report | Global Forecast From 2025 To 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). AI Training Dataset Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-ai-training-dataset-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    AI Training Dataset Market Outlook



    The global AI training dataset market size was valued at approximately USD 1.2 billion in 2023 and is projected to reach USD 6.5 billion by 2032, growing at a compound annual growth rate (CAGR) of 20.5% from 2024 to 2032. This substantial growth is driven by the increasing adoption of artificial intelligence across various industries, the necessity for large-scale and high-quality datasets to train AI models, and the ongoing advancements in AI and machine learning technologies.



    One of the primary growth factors in the AI training dataset market is the exponential increase in data generation across multiple sectors. With the proliferation of internet usage, the expansion of IoT devices, and the digitalization of industries, there is an unprecedented volume of data being generated daily. This data is invaluable for training AI models, enabling them to learn and make more accurate predictions and decisions. Moreover, the need for diverse and comprehensive datasets to improve AI accuracy and reliability is further propelling market growth.



    Another significant factor driving the market is the rising investment in AI and machine learning by both public and private sectors. Governments around the world are recognizing the potential of AI to transform economies and improve public services, leading to increased funding for AI research and development. Simultaneously, private enterprises are investing heavily in AI technologies to gain a competitive edge, enhance operational efficiency, and innovate new products and services. These investments necessitate high-quality training datasets, thereby boosting the market.



    The proliferation of AI applications in various industries, such as healthcare, automotive, retail, and finance, is also a major contributor to the growth of the AI training dataset market. In healthcare, AI is being used for predictive analytics, personalized medicine, and diagnostic automation, all of which require extensive datasets for training. The automotive industry leverages AI for autonomous driving and vehicle safety systems, while the retail sector uses AI for personalized shopping experiences and inventory management. In finance, AI assists in fraud detection and risk management. The diverse applications across these sectors underline the critical need for robust AI training datasets.



    As the demand for AI applications continues to grow, the role of Ai Data Resource Service becomes increasingly vital. These services provide the necessary infrastructure and tools to manage, curate, and distribute datasets efficiently. By leveraging Ai Data Resource Service, organizations can ensure that their AI models are trained on high-quality and relevant data, which is crucial for achieving accurate and reliable outcomes. The service acts as a bridge between raw data and AI applications, streamlining the process of data acquisition, annotation, and validation. This not only enhances the performance of AI systems but also accelerates the development cycle, enabling faster deployment of AI-driven solutions across various sectors.



    Regionally, North America currently dominates the AI training dataset market due to the presence of major technology companies and extensive R&D activities in the region. However, Asia Pacific is expected to witness the highest growth rate during the forecast period, driven by rapid technological advancements, increasing investments in AI, and the growing adoption of AI technologies across various industries in countries like China, India, and Japan. Europe and Latin America are also anticipated to experience significant growth, supported by favorable government policies and the increasing use of AI in various sectors.



    Data Type Analysis



    The data type segment of the AI training dataset market encompasses text, image, audio, video, and others. Each data type plays a crucial role in training different types of AI models, and the demand for specific data types varies based on the application. Text data is extensively used in natural language processing (NLP) applications such as chatbots, sentiment analysis, and language translation. As the use of NLP is becoming more widespread, the demand for high-quality text datasets is continually rising. Companies are investing in curated text datasets that encompass diverse languages and dialects to improve the accuracy and efficiency of NLP models.



    Image data is critical for computer vision application

  8. d

    FileMarket | 10,000 HQ Model Images from Multiple Angles for AI | LLM | ML |...

    • datarade.ai
    Updated Aug 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FileMarket (2024). FileMarket | 10,000 HQ Model Images from Multiple Angles for AI | LLM | ML | DL Training Data [Dataset]. https://datarade.ai/data-products/filemarket-10-000-hq-model-images-from-multiple-angles-for-filemarket
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Aug 18, 2024
    Dataset authored and provided by
    FileMarket
    Area covered
    Croatia, Malaysia, Switzerland, Austria, Liechtenstein, Oman, Sri Lanka, Vietnam, Jordan, Romania
    Description

    Overview: FileMarket's dataset offers 10,000 high-resolution images of professional models, captured in a controlled studio environment by experienced photographers. Each image is expertly lit to ensure clarity and consistency across all photos, making this dataset an invaluable resource for various AI-driven applications.

    What Makes This Data Unique? This dataset stands out due to its meticulous attention to quality. Each model is photographed from multiple angles, providing a comprehensive view that is ideal for AI training. The diversity of models, encompassing various ethnicities, ages, and body types, ensures that the data is representative and inclusive. The consistency in lighting and background across all images reduces the need for additional preprocessing, making the data immediately usable for machine learning and deep learning projects.

    Data Sourcing: The images in this dataset were sourced exclusively from professional studio shoots. The controlled environment ensures that each image meets the highest standards, with consistent lighting, background, and quality. The photographers involved have extensive experience in fashion and commercial photography, guaranteeing that every image is of premium quality.

    Primary Use-Cases: This dataset is versatile and can be effectively used in several AI and machine learning contexts, including:

    Object Detection Data: The clear and consistent images make this dataset ideal for training models in object detection, specifically in identifying human figures and facial features. Machine Learning (ML) Data: The diversity and high quality of the images are perfect for feeding into machine learning algorithms, particularly those focused on human recognition and categorization. Deep Learning (DL) Data: The multi-angle shots of models offer a rich dataset for deep learning models that require a variety of perspectives to improve accuracy, such as in 3D reconstruction and pose estimation. Biometric Data: The detailed and varied images are suitable for training biometric systems, enhancing their ability to recognize and verify individuals across different conditions and contexts. Broader Data Offering: This dataset integrates seamlessly with other FileMarket offerings, allowing data buyers to combine it with other data types, such as text or video data, for more comprehensive AI training models. Whether for enhancing virtual try-on technologies for clothing and makeup or improving the accuracy of biometric systems, this dataset serves as a cornerstone in developing robust AI applications.

  9. d

    FileMarket | 20,000 photos | AI Training Data | Large Language Model (LLM)...

    • datarade.ai
    Updated Nov 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FileMarket (2023). FileMarket | 20,000 photos | AI Training Data | Large Language Model (LLM) Data | Machine Learning (ML) Data | Deep Learning (DL) Data | [Dataset]. https://datarade.ai/data-categories/deep-learning-dl-data/datasets
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Nov 20, 2023
    Dataset authored and provided by
    FileMarket
    Area covered
    Anguilla, Saint Kitts and Nevis, Bonaire, Central African Republic, China, Moldova (Republic of), Sweden, Saint Vincent and the Grenadines, Nauru, Greece
    Description

    FileMarket provides premium Large Language Model (LLM) Data designed to support and enhance a wide range of AI applications. Our globally sourced LLM Data sets are meticulously curated to ensure high quality, diversity, and accuracy, making them ideal for training robust and reliable language models. In addition to LLM Data, we also offer comprehensive datasets across Object Detection Data, Machine Learning (ML) Data, Deep Learning (DL) Data, and Biometric Data. Each dataset is carefully crafted to meet the specific needs of cutting-edge AI and machine learning projects.

    Key use cases of our Large Language Model (LLM) Data:

    Text generation Chatbots and virtual assistants Machine translation Sentiment analysis Speech recognition Content summarization Why choose FileMarket's data:

    Object Detection Data: Essential for training AI in image and video analysis. Machine Learning (ML) Data: Ideal for a broad spectrum of applications, from predictive analysis to NLP. Deep Learning (DL) Data: Designed to support complex neural networks and deep learning models. Biometric Data: Specialized for facial recognition, fingerprint analysis, and other biometric applications. FileMarket's premier sources for top-tier Large Language Model (LLM) Data and other specialized datasets ensure your AI projects drive innovation and achieve success across various applications.

  10. Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human...

    • data.nist.gov
    • s.cnmilf.com
    • +1more
    Updated Oct 23, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brian DeCost (2020). Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models [Dataset]. http://doi.org/10.18434/mds2-2301
    Explore at:
    Dataset updated
    Oct 23, 2020
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Authors
    Brian DeCost
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    The open dataset, software, and other files accompanying the manuscript "An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models," submitted for publication to Integrated Materials and Manufacturing Innovations. Machine learning and autonomy are increasingly prevalent in materials science, but existing models are often trained or tuned using idealized data as absolute ground truths. In actual materials science, "ground truth" is often a matter of interpretation and is more readily determined by consensus. Here we present the data, software, and other files for a study using as-obtained diffraction data as a test case for evaluating the performance of machine learning models in the presence of differing expert opinions. We demonstrate that experts with similar backgrounds can disagree greatly even for something as intuitive as using diffraction to identify the start and end of a phase transformation. We then use a logarithmic likelihood method to evaluate the performance of machine learning models in relation to the consensus expert labels and their variance. We further illustrate this method's efficacy in ranking a number of state-of-the-art phase mapping algorithms. We propose a materials data challenge centered around the problem of evaluating models based on consensus with uncertainty. The data, labels, and code used in this study are all available online at data.gov, and the interested reader is encouraged to replicate and improve the existing models or to propose alternative methods for evaluating algorithmic performance.

  11. i

    Dataset for the manuscript of Analysis on constructing the training data to...

    • ieee-dataport.org
    Updated Jun 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dianxin Luan (2024). Dataset for the manuscript of Analysis on constructing the training data to train neural networks for channel estimation [Dataset]. https://ieee-dataport.org/documents/dataset-manuscript-analysis-constructing-training-data-train-neural-networks-channel
    Explore at:
    Dataset updated
    Jun 20, 2024
    Authors
    Dianxin Luan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    but its feasibility is challenged by the tremendous computational resources required.

  12. d

    Process-guided deep learning water temperature predictions: 4 Training data

    • catalog.data.gov
    • data.usgs.gov
    • +3more
    Updated Jun 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Climate Adaptation Science Centers (2024). Process-guided deep learning water temperature predictions: 4 Training data [Dataset]. https://catalog.data.gov/dataset/process-guided-deep-learning-water-temperature-predictions-4-training-data-dca4e
    Explore at:
    Dataset updated
    Jun 15, 2024
    Dataset provided by
    Climate Adaptation Science Centers
    Description

    This dataset includes compiled water temperature data from a variety of sources, including the Water Quality Portal (Read et al. 2017), the North Temperate Lakes Long-TERM Ecological Research Program (https://lter.limnology.wisc.edu/), the Minnesota department of Natural Resources, and the Global Lake Ecological Observatory Network (gleon.org). This dataset is part of a larger data release of lake temperature model inputs and outputs for 68 lakes in the U.S. states of Minnesota and Wisconsin (http://dx.doi.org/10.5066/P9AQPIVD).

  13. A

    AI Training Dataset Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Apr 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). AI Training Dataset Report [Dataset]. https://www.datainsightsmarket.com/reports/ai-training-dataset-1501897
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Apr 30, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The AI training dataset market is experiencing robust growth, driven by the increasing adoption of artificial intelligence across diverse sectors. The market's expansion is fueled by the urgent need for high-quality data to train sophisticated AI models capable of handling complex tasks. Key application areas, such as autonomous vehicles in the automotive industry, advanced medical diagnosis in healthcare, and personalized experiences in retail and e-commerce, are significantly contributing to this market's upward trajectory. The prevalence of text, image/video, and audio data types further diversifies the market, offering opportunities for specialized dataset providers. While the market faces challenges like data privacy concerns and the high cost of data annotation, the overall trajectory remains positive, with a projected Compound Annual Growth Rate (CAGR) exceeding 20% for the forecast period (2025-2033). This growth is further supported by advancements in deep learning techniques that demand increasingly larger and more diverse datasets for optimal performance. Leading companies like Google, Amazon, and Microsoft are actively investing in this space, expanding their dataset offerings and fostering competition within the market. Furthermore, the emergence of specialized data annotation providers caters to the specific needs of various industries, ensuring accurate and reliable data for AI model development. The geographic distribution of the market reveals strong presence in North America and Europe, driven by early adoption of AI technologies and the presence of major technology players. However, Asia Pacific is projected to witness significant growth in the coming years, propelled by increasing digitalization and a burgeoning AI ecosystem in countries like China and India. Government initiatives promoting AI development in various regions are also expected to stimulate demand for high-quality training datasets. While challenges related to data security and ethical considerations remain, the long-term outlook for the AI training dataset market is exceptionally promising, fueled by the continued evolution of artificial intelligence and its increasing integration into various aspects of modern life. The market segmentation by application and data type allows for granular analysis and targeted investments for businesses operating in this rapidly expanding sector.

  14. Machine Learning Dataset

    • brightdata.com
    .json, .csv, .xlsx
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2025). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Jul 5, 2025
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.

  15. U

    U.S. AI Training Dataset Market Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). U.S. AI Training Dataset Market Report [Dataset]. https://www.archivemarketresearch.com/reports/us-ai-training-dataset-market-4957
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    May 19, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    United States
    Variables measured
    Market Size
    Description

    The U.S. AI Training Dataset Market size was valued at USD 590.4 million in 2023 and is projected to reach USD 1880.70 million by 2032, exhibiting a CAGR of 18.0 % during the forecasts period. The U. S. AI training dataset market deals with the generation, selection, and organization of datasets used in training artificial intelligence. These datasets contain the requisite information that the machine learning algorithms need to infer and learn from. Conducts include the advancement and improvement of AI solutions in different fields of business like transport, medical analysis, computing language, and money related measurements. The applications include training the models for activities such as image classification, predictive modeling, and natural language interface. Other emerging trends are the change in direction of more and better-quality, various and annotated data for the improvement of model efficiency, synthetic data generation for data shortage, and data confidentiality and ethical issues in dataset management. Furthermore, due to arising technologies in artificial intelligence and machine learning, there is a noticeable development in building and using the datasets. Recent developments include: In February 2024, Google struck a deal worth USD 60 million per year with Reddit that will give the former real-time access to the latter’s data and use Google AI to enhance Reddit’s search capabilities. , In February 2024, Microsoft announced around USD 2.1 billion investment in Mistral AI to expedite the growth and deployment of large language models. The U.S. giant is expected to underpin Mistral AI with Azure AI supercomputing infrastructure to provide top-notch scale and performance for AI training and inference workloads. .

  16. e

    power tower training dataset for deep learning model - Dataset -...

    • energydata.info
    Updated Aug 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). power tower training dataset for deep learning model - Dataset - ENERGYDATA.INFO [Dataset]. https://energydata.info/dataset/power-tower-training-dataset-for-deep-learning-model
    Explore at:
    Dataset updated
    Aug 30, 2024
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This is the training dataset for power tower deep learning model development. The dataset contains 50cm resolution Mapbox image tiles (Maxar imagery) as well as the power tower location presence in the imagery as geojson file. Both the geographic coordinates and the pixel coordinates of the power towers have been incorporated. The dataset covers pilot areas in west coast of Liberia, Yemen and India.

  17. Single Layer Perceptron Dataset(Small)

    • kaggle.com
    Updated Apr 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ABIR HASAN 1703100 (2023). Single Layer Perceptron Dataset(Small) [Dataset]. http://doi.org/10.34740/kaggle/ds/3154953
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 19, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    ABIR HASAN 1703100
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    We have chosen a simple numpy array to implement the single layer perceptron algorithm. We have considered a total of 13 samples with three features and one class label. The class label is defined in binary 0 and 1. The training dataset contains eight data samples, while the validation dataset contains five. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9905947%2F7dc95405d7b0696adeb1c90f1cf8682b%2Ftraining%20data.jpg?generation=1681929479850322&alt=media" alt=""> Fig 1.1: Train Data https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9905947%2Fe83b9677df9780414f25471c72ead9ca%2Ftest%20data.jpg?generation=1681929512768929&alt=media" alt=""> Fig 1.2: Test Data Here the first value for every sample is considered 1, as the algorithm says the value of x0 should always be 1. But even without this characteristic, our code will give the correct output.

  18. CSIRO Sentinel-1 SAR image dataset of oil- and non-oil features for machine...

    • data.csiro.au
    • researchdata.edu.au
    Updated Dec 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Blondeau-Patissier; Thomas Schroeder; Foivos Diakogiannis; Zhibin Li (2022). CSIRO Sentinel-1 SAR image dataset of oil- and non-oil features for machine learning ( Deep Learning ) [Dataset]. http://doi.org/10.25919/4v55-dn16
    Explore at:
    Dataset updated
    Dec 15, 2022
    Dataset provided by
    CSIROhttp://www.csiro.au/
    Authors
    David Blondeau-Patissier; Thomas Schroeder; Foivos Diakogiannis; Zhibin Li
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Time period covered
    May 1, 2015 - Aug 31, 2022
    Area covered
    Dataset funded by
    CSIROhttp://www.csiro.au/
    ESA
    Description

    What this collection is: A curated, binary-classified image dataset of grayscale (1 band) 400 x 400-pixel size, or image chips, in a JPEG format extracted from processed Sentinel-1 Synthetic Aperture Radar (SAR) satellite scenes acquired over various regions of the world, and featuring clear open ocean chips, look-alikes (wind or biogenic features) and oil slick chips.

    This binary dataset contains chips labelled as: - "0" for chips not containing any oil features (look-alikes or clean seas)
    - "1" for those containing oil features.

    This binary dataset is imbalanced, and biased towards "0" labelled chips (i.e., no oil features), which correspond to 66% of the dataset. Chips containing oil features, labelled "1", correspond to 34% of the dataset.

    Why: This dataset can be used for training, validation and/or testing of machine learning, including deep learning, algorithms for the detection of oil features in SAR imagery. Directly applicable for algorithm development for the European Space Agency Sentinel-1 SAR mission (https://sentinel.esa.int/web/sentinel/missions/sentinel-1 ), it may be suitable for the development of detection algorithms for other SAR satellite sensors.

    Overview of this dataset: Total number of chips (both classes) is N=5,630 Class 0 1 Total 3,725 1,905

    Further information and description is found in the ReadMe file provided (ReadMe_Sentinel1_SAR_OilNoOil_20221215.txt)

  19. c

    Training datasets for AIMNet2 machine-learned neural network potential

    • kilthub.cmu.edu
    txt
    Updated Jan 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roman Zubatiuk; Olexandr Isayev; Dylan Anstine (2025). Training datasets for AIMNet2 machine-learned neural network potential [Dataset]. http://doi.org/10.1184/R1/27629937.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 27, 2025
    Dataset provided by
    Carnegie Mellon University
    Authors
    Roman Zubatiuk; Olexandr Isayev; Dylan Anstine
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The datasets contain molecular structures and the properties computed with B97-3c (GGA DFT) or wB97M-def2-TZVPP (range-separated hybrid DFT) methods. Each data file contains about 20M structures. DFT calculation performed with ORCA 5.0.3 software. Properties include energy, forces, atomic charges, and molecular dipole and quadrupole moments.

  20. m

    Encrypted Traffic Feature Dataset for Machine Learning and Deep Learning...

    • data.mendeley.com
    Updated Dec 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zihao Wang (2022). Encrypted Traffic Feature Dataset for Machine Learning and Deep Learning based Encrypted Traffic Analysis [Dataset]. http://doi.org/10.17632/xw7r4tt54g.1
    Explore at:
    Dataset updated
    Dec 6, 2022
    Authors
    Zihao Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This traffic dataset contains a balance size of encrypted malicious and legitimate traffic for encrypted malicious traffic detection and analysis. The dataset is a secondary csv feature data that is composed of six public traffic datasets.

    Our dataset is curated based on two criteria: The first criterion is to combine widely considered public datasets which contain enough encrypted malicious or encrypted legitimate traffic in existing works, such as Malware Capture Facility Project datasets. The second criterion is to ensure the final dataset balance of encrypted malicious and legitimate network traffic.

    Based on the criteria, 6 public datasets are selected. After data pre-processing, details of each selected public dataset and the size of different encrypted traffic are shown in the “Dataset Statistic Analysis Document”. The document summarized the malicious and legitimate traffic size we selected from each selected public dataset, the traffic size of each malicious traffic type, and the total traffic size of the composed dataset. From the table, we are able to observe that encrypted malicious and legitimate traffic equally contributes to approximately 50% of the final composed dataset.

    The datasets now made available were prepared to aim at encrypted malicious traffic detection. Since the dataset is used for machine learning or deep learning model training, a sample of train and test sets are also provided. The train and test datasets are separated based on 1:4. Such datasets can be used for machine learning or deep learning model training and testing based on selected features or after processing further data pre-processing.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0

Training dataset for NABat Machine Learning V1.0

Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description

Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

Search
Clear search
Close search
Google apps
Main menu