100+ datasets found
  1. e

    SYNERGY - Open machine learning dataset on study selection in systematic...

    • b2find.eudat.eu
    Updated Jul 21, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). SYNERGY - Open machine learning dataset on study selection in systematic reviews - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/1bea4d3c-ceef-5f63-89ed-80aeab18f601
    Explore at:
    Dataset updated
    Jul 21, 2024
    Description

    SYNERGY is a free and open dataset on study selection in systematic reviews, comprising 169,288 academic works from 26 systematic reviews. Only 2,834 (1.67%) of the academic works in the binary classified dataset are included in the systematic reviews. This makes the SYNERGY dataset a unique dataset for the development of information retrieval algorithms, especially for sparse labels. Due to the many available variables available per record (i.e. titles, abstracts, authors, references, topics), this dataset is useful for researchers in NLP, machine learning, network analysis, and more. In total, the dataset contains 82,668,134 trainable data points. The easiest way to get the SYNERGY dataset is via the synergy-dataset Python package. See https://github.com/asreview/synergy-dataset for all information. The recommended way to work with the SYNERGY dataset is via the Python package "synergy-dataset". This flexible package downloads and builds the dataset.

  2. Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

    GitHub page: https://github.com/soarsmu/NICHE

  3. d

    Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human...

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Sep 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2025). Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models [Dataset]. https://catalog.data.gov/dataset/dataset-an-open-combinatorial-diffraction-dataset-including-consensus-human-and-machine-le
    Explore at:
    Dataset updated
    Sep 30, 2025
    Dataset provided by
    National Institute of Standards and Technology
    Description

    The open dataset, software, and other files accompanying the manuscript "An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models," submitted for publication to Integrated Materials and Manufacturing Innovations.Machine learning and autonomy are increasingly prevalent in materials science, but existing models are often trained or tuned using idealized data as absolute ground truths. In actual materials science, "ground truth" is often a matter of interpretation and is more readily determined by consensus. Here we present the data, software, and other files for a study using as-obtained diffraction data as a test case for evaluating the performance of machine learning models in the presence of differing expert opinions. We demonstrate that experts with similar backgrounds can disagree greatly even for something as intuitive as using diffraction to identify the start and end of a phase transformation. We then use a logarithmic likelihood method to evaluate the performance of machine learning models in relation to the consensus expert labels and their variance. We further illustrate this method's efficacy in ranking a number of state-of-the-art phase mapping algorithms. We propose a materials data challenge centered around the problem of evaluating models based on consensus with uncertainty. The data, labels, and code used in this study are all available online at data.gov, and the interested reader is encouraged to replicate and improve the existing models or to propose alternative methods for evaluating algorithmic performance.

  4. Data from: Open Images

    • kaggle.com
    • opendatalab.com
    zip
    Updated Feb 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2019). Open Images [Dataset]. https://www.kaggle.com/datasets/bigquery/open-images
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Feb 12, 2019
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Authors
    Google BigQuery
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    Labeled datasets are useful in machine learning research.

    Content

    This public dataset contains approximately 9 million URLs and metadata for images that have been annotated with labels spanning more than 6,000 categories.

    Tables: 1) annotations_bbox 2) dict 3) images 4) labels

    Update Frequency: Quarterly

    Querying BigQuery Tables

    Fork this kernel to get started.

    Acknowledgements

    https://bigquery.cloud.google.com/dataset/bigquery-public-data:open_images

    https://cloud.google.com/bigquery/public-data/openimages

    APA-style citation: Google Research (2016). The Open Images dataset [Image urls and labels]. Available from github: https://github.com/openimages/dataset.

    Use: The annotations are licensed by Google Inc. under CC BY 4.0 license.

    The images referenced in the dataset are listed as having a CC BY 2.0 license. Note: while we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.

    Banner Photo by Mattias Diesel from Unsplash.

    Inspiration

    Which labels are in the dataset? Which labels have "bus" in their display names? How many images of a trolleybus are in the dataset? What are some landing pages of images with a trolleybus? Which images with cherries are in the training set?

  5. Data from: AppClassNet - A commercial-grade dataset for application...

    • figshare.com
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dario rossi (2023). AppClassNet - A commercial-grade dataset for application identification research [Dataset]. http://doi.org/10.6084/m9.figshare.20375580.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    dario rossi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    AppClassNet is a commercial-grade dataset that represent a realistic benchmark for the use-case of traffic classification and management.

    The AppClassNet dataset is complemented by companion artifacts containing baseline code to train and test state of the art baseline models for a quick boostrap.

    A description of the dataset, the expected performance of the baseline models, the allowed and forbidden usages of the dataset, and more is available in a companion technical report [1]

    [1] https://dl.acm.org/doi/10.1145/3561954.3561958

  6. A Dataset for Machine Learning Algorithm Development

    • fisheries.noaa.gov
    • catalog.data.gov
    Updated Jan 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alaska Fisheries Science Center (AFSC) (2021). A Dataset for Machine Learning Algorithm Development [Dataset]. https://www.fisheries.noaa.gov/inport/item/63322
    Explore at:
    Dataset updated
    Jan 1, 2021
    Dataset provided by
    Alaska Fisheries Science Center
    Authors
    Alaska Fisheries Science Center (AFSC)
    Area covered
    Chukchi Sea, Alaska, Beaufort Sea, Kotzebue Sound
    Description

    This dataset consists of imagery, imagery footprints, associated ice seal detections and homography files associated with the KAMERA Test Flights conducted in 2019. This dataset was subset to include relevant data for detection algorithm development. This dataset is limited to data collected during flights 4, 5, 6 and 7 from our 2019 surveys.

  7. O

    BUTTER - Empirical Deep Learning Dataset

    • data.openei.org
    • datasets.ai
    • +2more
    code, data, website
    Updated May 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek; Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek (2022). BUTTER - Empirical Deep Learning Dataset [Dataset]. http://doi.org/10.25984/1872441
    Explore at:
    code, website, dataAvailable download formats
    Dataset updated
    May 20, 2022
    Dataset provided by
    Open Energy Data Initiative (OEDI)
    National Renewable Energy Laboratory
    USDOE Office of Energy Efficiency and Renewable Energy (EERE), Multiple Programs (EE)
    Authors
    Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek; Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The BUTTER Empirical Deep Learning Dataset represents an empirical study of the deep learning phenomena on dense fully connected networks, scanning across thirteen datasets, eight network shapes, fourteen depths, twenty-three network sizes (number of trainable parameters), four learning rates, six minibatch sizes, four levels of label noise, and fourteen levels of L1 and L2 regularization each. Multiple repetitions (typically 30, sometimes 10) of each combination of hyperparameters were preformed, and statistics including training and test loss (using a 80% / 20% shuffled train-test split) are recorded at the end of each training epoch. In total, this dataset covers 178 thousand distinct hyperparameter settings ("experiments"), 3.55 million individual training runs (an average of 20 repetitions of each experiments), and a total of 13.3 billion training epochs (three thousand epochs were covered by most runs). Accumulating this dataset consumed 5,448.4 CPU core-years, 17.8 GPU-years, and 111.2 node-years.

  8. DataCenter-Traces-Datasets

    • zenodo.org
    bin, csv
    Updated Dec 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alejandro Fernández-Montes; Alejandro Fernández-Montes; Damián Fernández Cerero; Damián Fernández Cerero (2024). DataCenter-Traces-Datasets [Dataset]. http://doi.org/10.5281/zenodo.14564935
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    Dec 28, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alejandro Fernández-Montes; Alejandro Fernández-Montes; Damián Fernández Cerero; Damián Fernández Cerero
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Nov 17, 2022
    Description

    Public datasets organized for machine learning or artificial intelligence usage. The following dasets can be used:

    Alibaba 2018 machine usage

    Processed from the original files found at: https://github.com/alibaba/clusterdata/tree/master/cluster-trace-v2018

    This repository dataset of machine usage includes the following columns:

    +--------------------------------------------------------------------------------------------+
    | Field      | Type    | Label | Comment                      |
    +--------------------------------------------------------------------------------------------+
    | cpu_util_percent | bigint   |    | [0, 100]                      |
    | mem_util_percent | bigint   |    | [0, 100]                      |
    | net_in      | double   |    | normarlized in coming network traffic, [0, 100]  |
    | net_out     | double   |    | normarlized out going network traffic, [0, 100]  |
    | disk_io_percent | double   |    | [0, 100], abnormal values are of -1 or 101     |
    +--------------------------------------------------------------------------------------------+

    Three sampled datasets are found: average value of each column grouped every 10 seconds as original, and downsampled to 30 seconds and 300 seconds as well. Every column includes the average utilization of the whole data center.

    Google 2019 instance usage

    Processed from the original dataset and queried using Big Query. More information available at: https://research.google/tools/datasets/google-cluster-workload-traces-2019/

    This repository dataset of instance usage includes the following columns:

    +--------------------------------------------------------------------------------------------+
    | Field             | Type    | Label | Comment                |
    +--------------------------------------------------------------------------------------------+
    | avg_cpu            | double   |    | [0, 1]                |
    | avg_mem            | double   |    | [0, 1]                |
    | avg_assigned_mem       | double   |    | [0, 1]                |
    | avg_cycles_per_instruction  | double   |    | [0, _]                |
    +--------------------------------------------------------------------------------------------+

    One sampled dataset is found: average value of each column grouped every 300 seconds as original. Every column includes the average utilization of the whole data center.


    Azure v2 virtual machine workload

    Processed from the original dataset. More information available at: https://github.com/Azure/AzurePublicDataset/blob/master/AzurePublicDatasetV2.md

    This repository dataset of instance usage includes the following columns:

    +--------------------------------------------------------------------------------------------+
    | Field             | Type    | Label | Comment                |
    +--------------------------------------------------------------------------------------------+
    | cpu_usage           | double   |    | [0, _]                |
    | assigned_mem         | double   |    | [0, _]                |
    +--------------------------------------------------------------------------------------------+

    One sampled dataset is found: sum value of each column grouped every 300 seconds as original. For computing CPU_usage, we used core_count usage of each virtual machine. Every column includes the total consumption of the whole data center virtual machines. There is a version of each file including timestamp (from 0 to 2591700, in 300 seconds timestep), and other version without timestamp

    Access Level
    The dataset is freely accessible under an Open Access model. There are no restrictions for reuse, and it is licensed under [Creative Commons Attribution 4.0 (CC-BY 4.0)](https://creativecommons.org/licenses/by/4.0/).

  9. h

    ember

    • huggingface.co
    Updated Jun 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SeokHee Chang (2025). ember [Dataset]. https://huggingface.co/datasets/cycloevan/ember
    Explore at:
    Dataset updated
    Jun 19, 2025
    Authors
    SeokHee Chang
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    EMBER Dataset

    EMBER (Elastic Malware Benchmark for Empowering Researchers) is an open dataset for training static PE malware machine learning models.

      References
    

    Paper: EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models GitHub: elastic/ember

  10. i

    Dataset of article: Synthetic Datasets Generator for Testing Information...

    • ieee-dataport.org
    Updated Mar 13, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlos Santos (2020). Dataset of article: Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques and Tools [Dataset]. https://ieee-dataport.org/open-access/dataset-article-synthetic-datasets-generator-testing-information-visualization-and
    Explore at:
    Dataset updated
    Mar 13, 2020
    Authors
    Carlos Santos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset used in the article entitled 'Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques and Tools'. These datasets can be used to test several characteristics in machine learning and data processing algorithms.

  11. LSD4WSD : An Open Dataset for Wet Snow Detection with SAR Data and Physical...

    • zenodo.org
    bin, pdf +1
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthieu Gallet; Matthieu Gallet; Abdourrahmane Atto; Abdourrahmane Atto; Fatima Karbou; Fatima Karbou; Emmanuel Trouvé; Emmanuel Trouvé (2024). LSD4WSD : An Open Dataset for Wet Snow Detection with SAR Data and Physical Labelling [Dataset]. http://doi.org/10.5281/zenodo.10046730
    Explore at:
    text/x-python, bin, pdfAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Matthieu Gallet; Matthieu Gallet; Abdourrahmane Atto; Abdourrahmane Atto; Fatima Karbou; Fatima Karbou; Emmanuel Trouvé; Emmanuel Trouvé
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LSD4WSD V2.0

    Learning SAR Dataset for Wet Snow Detection - Full Analysis Version.

    The aim of this dataset is to provide a basis for automatic learning to detect wet snow. It is based on Sentinel-1 SAR GRD satellite images acquired between August 2020 and August 2021 over the French Alps. The new version of this dataset is no longer simply restricted to a classification task, and provides a set of metadata for each sample.

    Modification and improvements of the version 2.0.0 :

    • Number of massif: add 7 new massif to cover the all Sentinel-1 images (cf info.pdf).
    • Acquisition: add images of the descending pass in addition to those originally used in the ascending pass.
    • Sample: reduction in the size of the samples considered to 15 by 15 to facilitate evaluation at the central pixel.
    • Sample: increased density of extracted windows, with a distance of approximately 500 meters between the centers of the windows.
    • Sample: removal of the pre-processing involving the use of logarithms.
    • Sample: removal of the pre-processing involving the normalisation.
    • Labels: new structure for the labels part: dictionary with keys: topography, metadata and physics.
    • Labels: physics: addition of direct information from the CROCUS model for 3 simulations: Liquid Water Content, snow height and minimum snowpack temperature.
    • Labels: topography: information on the slope, altitude and average orientation of the sample.
    • Labels: metadata : information on the date of the sample, the mountain massif and the run (ascending or descending).
    • Dataset: removal of the train/test split*

    We leave it up to the user to use the Group Kfold method to validate the models using the alpine massif information.

    Finally, it consists of 2467516 samples of size 15 by 15 by 9. For each sample, the 9 metadata are provided, using in particular the Crocus physical model:

    • topography:
      • elevation (meters) (average),
      • orientation (degrees) (average),
      • slope (degrees) (average),
    • metadata:
      • name of the alpine massif,
      • date of acquisition,
      • type of acquisition (ascending/descending),
    • physics
      • Liquid Water Content (km/m2),
      • snow height (m),
      • minimum snowpack temperature (Celsius degree).

    The 9 channels are in the following order:

    • Sentinel-1 polarimetric channels: VV, VH and the combination C: VV/VH in linear,
    • Topographical features: altitude, orientation, slope
    • Polarimetric ratio with a reference summer image: VV/VVref, VH/VHref, C/Cref

    * The reference image selected is that of August 9th 2020, as a reference image without snow (cf. Nagler&al)

    An overview of the distribution and a summary of the sample statistics can be found in the file info.pdf.

    The data is stored in .hdf5 format with gzip compression. We provide a python script to read and request the data. The script is dataset_load.py. It is based on the h5py, numpy and pandas libraries. It allows to select a part or the whole dataset using requests on the metadata. The script is documented and can be used as described in the README.md file

    The processing chain is available at the following Github address.

    The authors would like to acknowledge the support from the National Centre for Space Studies (CNES) in providing computing facilities and access to SAR images via the PEPS platform.

    The authors would like to deeply thank Mathieu Fructus for running the Crocus simulations.

    Erratum :

    In the dataloader file, the name of the "aquisition" column must be added twice, see the correction below.:

    dtst_ld = Dataset_loader(path_dataset,shuffle=False,descrp=["date","massif","aquisition","aquisition","elevation","slope","orientation","tmin","hsnow","tel",],)

    If you have any comments, questions or suggestions, please contact the authors:

    • matthieu.gallet@univ-smb.fr
    • fatima.karbou@meteo.fr
    • abdourrahmane.atto@univ-smb.fr
    • emmanuel.trouve@univ-smb.fr

  12. Machine learning code and best models.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    zip
    Updated Apr 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qingxin Yang; Li Luo; Zhangpeng Lin; Wei Wen; Wenbo Zeng; Hong Deng (2024). Machine learning code and best models. [Dataset]. http://doi.org/10.1371/journal.pone.0300662.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 17, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Qingxin Yang; Li Luo; Zhangpeng Lin; Wei Wen; Wenbo Zeng; Hong Deng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    They are available at https://github.com/nerdyqx/ML. (ZIP)

  13. D

    AI Training Dataset Market Report | Global Forecast From 2025 To 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). AI Training Dataset Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-ai-training-dataset-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    AI Training Dataset Market Outlook



    The global AI training dataset market size was valued at approximately USD 1.2 billion in 2023 and is projected to reach USD 6.5 billion by 2032, growing at a compound annual growth rate (CAGR) of 20.5% from 2024 to 2032. This substantial growth is driven by the increasing adoption of artificial intelligence across various industries, the necessity for large-scale and high-quality datasets to train AI models, and the ongoing advancements in AI and machine learning technologies.



    One of the primary growth factors in the AI training dataset market is the exponential increase in data generation across multiple sectors. With the proliferation of internet usage, the expansion of IoT devices, and the digitalization of industries, there is an unprecedented volume of data being generated daily. This data is invaluable for training AI models, enabling them to learn and make more accurate predictions and decisions. Moreover, the need for diverse and comprehensive datasets to improve AI accuracy and reliability is further propelling market growth.



    Another significant factor driving the market is the rising investment in AI and machine learning by both public and private sectors. Governments around the world are recognizing the potential of AI to transform economies and improve public services, leading to increased funding for AI research and development. Simultaneously, private enterprises are investing heavily in AI technologies to gain a competitive edge, enhance operational efficiency, and innovate new products and services. These investments necessitate high-quality training datasets, thereby boosting the market.



    The proliferation of AI applications in various industries, such as healthcare, automotive, retail, and finance, is also a major contributor to the growth of the AI training dataset market. In healthcare, AI is being used for predictive analytics, personalized medicine, and diagnostic automation, all of which require extensive datasets for training. The automotive industry leverages AI for autonomous driving and vehicle safety systems, while the retail sector uses AI for personalized shopping experiences and inventory management. In finance, AI assists in fraud detection and risk management. The diverse applications across these sectors underline the critical need for robust AI training datasets.



    As the demand for AI applications continues to grow, the role of Ai Data Resource Service becomes increasingly vital. These services provide the necessary infrastructure and tools to manage, curate, and distribute datasets efficiently. By leveraging Ai Data Resource Service, organizations can ensure that their AI models are trained on high-quality and relevant data, which is crucial for achieving accurate and reliable outcomes. The service acts as a bridge between raw data and AI applications, streamlining the process of data acquisition, annotation, and validation. This not only enhances the performance of AI systems but also accelerates the development cycle, enabling faster deployment of AI-driven solutions across various sectors.



    Regionally, North America currently dominates the AI training dataset market due to the presence of major technology companies and extensive R&D activities in the region. However, Asia Pacific is expected to witness the highest growth rate during the forecast period, driven by rapid technological advancements, increasing investments in AI, and the growing adoption of AI technologies across various industries in countries like China, India, and Japan. Europe and Latin America are also anticipated to experience significant growth, supported by favorable government policies and the increasing use of AI in various sectors.



    Data Type Analysis



    The data type segment of the AI training dataset market encompasses text, image, audio, video, and others. Each data type plays a crucial role in training different types of AI models, and the demand for specific data types varies based on the application. Text data is extensively used in natural language processing (NLP) applications such as chatbots, sentiment analysis, and language translation. As the use of NLP is becoming more widespread, the demand for high-quality text datasets is continually rising. Companies are investing in curated text datasets that encompass diverse languages and dialects to improve the accuracy and efficiency of NLP models.



    Image data is critical for computer vision application

  14. m

    KU-HAR: An Open Dataset for Human Activity Recognition

    • data.mendeley.com
    Updated Feb 16, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdullah-Al Nahid (2021). KU-HAR: An Open Dataset for Human Activity Recognition [Dataset]. http://doi.org/10.17632/45f952y38r.5
    Explore at:
    Dataset updated
    Feb 16, 2021
    Authors
    Abdullah-Al Nahid
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    (Always use the latest version of the dataset. )

    Human Activity Recognition (HAR) refers to the capacity of machines to perceive human actions. This dataset contains information on 18 different activities collected from 90 participants (75 male and 15 female) using smartphone sensors (Accelerometer and Gyroscope). It has 1945 raw activity samples collected directly from the participants, and 20750 subsamples extracted from them. The activities are:

    Stand➞ Standing still (1 min) Sit➞ Sitting still (1 min) Talk-sit➞ Talking with hand movements while sitting (1 min) Talk-stand➞ Talking with hand movements while standing or walking(1 min) Stand-sit➞ Repeatedly standing up and sitting down (5 times) Lay➞ Laying still (1 min) Lay-stand➞ Repeatedly standing up and laying down (5 times) Pick➞ Picking up an object from the floor (10 times) Jump➞ Jumping repeatedly (10 times) Push-up➞ Performing full push-ups (5 times) Sit-up➞ Performing sit-ups (5 times) Walk➞ Walking 20 meters (≈12 s) Walk-backward➞ Walking backward for 20 meters (≈20 s) Walk-circle➞ Walking along a circular path (≈ 20 s) Run➞ Running 20 meters (≈7 s) Stair-up➞ Ascending on a set of stairs (≈1 min) Stair-down➞ Descending from a set of stairs (≈50 s) Table-tennis➞ Playing table tennis (1 min)

    Contents of the attached .zip files are: 1.Raw_time_domian_data.zip➞ Originally collected 1945 time-domain samples in separate .csv files. The arrangement of information in each .csv file is: Column 1, 5➞ exact time (elapsed since the start) when the Accelerometer & Gyro output was recorded (in ms) Col. 2, 3, 4➞ Acceleration along X,Y,Z axes (in m/s^2) Col. 6, 7, 8➞ Rate of rotation around X,Y,Z axes (in rad/s)

    2.Trimmed_interpolated_raw_data.zip➞ Unnecessary parts of the samples were trimmed (only from the beginning and the end). The samples were interpolated to keep a constant sampling rate of 100 Hz. The arrangement of information is the same as above.

    3.Time_domain_subsamples.zip➞ 20750 subsamples extracted from the 1945 collected samples provided in a single .csv file. Each of them contains 3 seconds of non-overlapping data of the corresponding activity. Arrangement of information: Col. 1–300, 301–600, 601–900➞ Acc.meter X, Y, Z axes readings Col. 901–1200, 1201–1500, 1501–1800➞ Gyro X, Y, Z axes readings Col. 1801➞ Class ID (0 to 17, in the order mentioned above) Col. 1802➞ length of the each channel data in the subsample Col. 1803➞ serial no. of the subsample

    Gravity acceleration was omitted from the Acc.meter data, and no filter was applied to remove noise. The dataset is free to download, modify, and use.

    More information is provided in the data paper which is currently under review: N. Sikder, A.-A. Nahid, KU-HAR: An open dataset for heterogeneous human activity recognition, Pattern Recognit. Lett. (submitted).

    A preprint will be available soon.

    Backup: drive.google.com/drive/folders/1yrG8pwq3XMlyEGYMnM-8xnrd6js0oXA7

  15. m

    Data from: Potrika: Raw and Balanced Newspaper Datasets in the Bangla...

    • data.mendeley.com
    Updated Nov 7, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Istiak Ahmad (2022). Potrika: Raw and Balanced Newspaper Datasets in the Bangla Language with Eight Topics and Five Attributes [Dataset]. http://doi.org/10.17632/v362rp78dc.4
    Explore at:
    Dataset updated
    Nov 7, 2022
    Authors
    Istiak Ahmad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Knowledge is central to human and scientific developments. Natural Language Processing (NLP) allows automated analysis and creation of knowledge. Data is a crucial NLP and machine learning ingredient. The scarcity of open datasets is a well-known problem in the machine and deep learning research. This is very much the case for textual NLP datasets in English and other major world languages. For the Bangla language, the situation is even more challenging and the number of large datasets for NLP research is practically nil. We hereby present Potrika, a large single-label Bangla news article textual dataset curated for NLP research from six popular online news portals in Bangladesh (Jugantor, Jaijaidin, Ittefaq, Kaler Kontho, Inqilab, and Somoyer Alo) for the period 2014-2020. The articles are classified into eight distinct categories (National, Sports, International, Entertainment, Economy, Education, Politics, and Science & Technology) providing five attributes (News Article, Category, Headline, Publication Date, and Newspaper Source). The raw dataset contains 185.51 million words and 12.57 million sentences contained in 664,880 news articles. Moreover, using NLP augmentation techniques, we create from the raw (unbalanced) dataset another (balanced) dataset comprising 320,000 news articles with 40,000 articles in each of the eight news categories. Potrika contains both datasets (raw and balanced) to suit a wide range of NLP research. By far, to the best of our knowledge, Potrika is the largest and the most extensive dataset for news classification.

    Further details of the dataset, its collection, and usage can be found in our article here: https://doi.org/10.48550/arXiv.2210.09389.

  16. Data from: MLOmics: Cancer Multi-Omics Database for Machine Learning

    • figshare.com
    bin
    Updated May 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rikuto Kotoge (2025). MLOmics: Cancer Multi-Omics Database for Machine Learning [Dataset]. http://doi.org/10.6084/m9.figshare.28729127.v2
    Explore at:
    binAvailable download formats
    Dataset updated
    May 25, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Rikuto Kotoge
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Framing the investigation of diverse cancers as a machine learning problem has recently shown significant potential in multi-omics analysis and cancer research. Empowering these successful machine learning models are the high-quality training datasets with sufficient data volume and adequate preprocessing. However, while there exist several public data portals including The Cancer Genome Atlas (TCGA) multi-omics initiative or open-bases such as the LinkedOmics, these databases are not off-the-shelf for existing machine learning models. we propose MLOmics, an open cancer multi-omics database aiming at serving better the development and evaluation of bioinformatics and machine learning models. MLOmics contains 8,314 patient samples covering all 32 cancer types with four omics types, stratified features, and extensive baselines. Complementary support for downstream analysis and bio-knowledge linking are also included to support interdisciplinary analysis.

  17. i

    Malware Analysis Datasets: Top-1000 PE Imports

    • ieee-dataport.org
    Updated Nov 8, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Angelo Oliveira (2019). Malware Analysis Datasets: Top-1000 PE Imports [Dataset]. https://ieee-dataport.org/open-access/malware-analysis-datasets-top-1000-pe-imports
    Explore at:
    Dataset updated
    Nov 8, 2019
    Authors
    Angelo Oliveira
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is part of my PhD research on malware detection and classification using Deep Learning. It contains static analysis data: Top-1000 imported functions extracted from the 'pe_imports' elements of Cuckoo Sandbox reports. PE malware examples were downloaded from virusshare.com. PE goodware examples were downloaded from portableapps.com and from Windows 7 x86 directories.

  18. d

    Machine learning model that estimates public-supply deliveries for domestic...

    • catalog.data.gov
    • data.usgs.gov
    • +2more
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Machine learning model that estimates public-supply deliveries for domestic and other use types [Dataset]. https://catalog.data.gov/dataset/machine-learning-model-that-estimates-public-supply-deliveries-for-domestic-and-other-use-
    Explore at:
    Dataset updated
    Oct 1, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    This child item describes a public-supply delivery machine learning model that was developed to estimate public-supply deliveries. Publicly supplied water may be delivered to domestic users or to commercial, industrial, institutional, and irrigation (CII) users. This model predicts total, domestic, and CII per capita rates for public-supply water service areas within the conterminous United States for 2009-2020. This child item contains model input datasets, code used to build the delivery machine learning model, and national predictions. This dataset is part of a larger data release using machine learning to predict public-supply water use for 12-digit hydrologic units from 2000-2020. This page includes the following file: delivery_water_use_model.zip - a zip file containing input datasets, scripts, and output datasets for the delivery water use machine learning model

  19. Zenodo Open Metadata snapshot - Training dataset for records and communities...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bin
    Updated Dec 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo team; Zenodo team (2022). Zenodo Open Metadata snapshot - Training dataset for records and communities classifier building [Dataset]. http://doi.org/10.5281/zenodo.7438358
    Explore at:
    bin, application/gzipAvailable download formats
    Dataset updated
    Dec 15, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Zenodo team; Zenodo team
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains Zenodo's published open access records and communities metadata, including entries marked by the Zenodo staff as spam and deleted.

    The datasets are gzipped compressed JSON-lines files, where each line is a JSON object representation of a Zenodo record or community.

    Records dataset

    Filename: zenodo_open_metadata_{ date of export }.jsonl.gz

    Each object contains the terms: part_of, thesis, description, doi, meeting, imprint, references, recid, alternate_identifiers, resource_type, journal, related_identifiers, title, subjects, notes, creators, communities, access_right, keywords, contributors, publication_date

    which correspond to the fields with the same name available in Zenodo's record JSON Schema at https://zenodo.org/schemas/records/record-v1.0.0.json.

    In addition, some terms have been altered:

    • The term files contains a list of dictionaries containing filetype, size, and filename only.
    • The term license contains a short Zenodo ID of the license (e.g. "cc-by").

    Communities dataset

    Filename: zenodo_community_metadata_{ date of export }.jsonl.gz

    Each object contains the terms: id, title, description, curation_policy, page

    which correspond to the fields with the same name available in Zenodo's community creation form.

    Notes for all datasets

    For each object the term spam contains a boolean value, determining whether a given record/community was marked as spam content by Zenodo staff.

    Some values for the top-level terms, which were missing in the metadata may contain a null value.

    A smaller uncompressed random sample of 200 JSON lines is also included for each dataset to test and get familiar with the format without having to download the entire dataset.

  20. i

    Big Data Machine Learning Benchmark on Spark

    • ieee-dataport.org
    Updated Jun 6, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jairson Rodrigues (2019). Big Data Machine Learning Benchmark on Spark [Dataset]. https://ieee-dataport.org/open-access/big-data-machine-learning-benchmark-spark
    Explore at:
    Dataset updated
    Jun 6, 2019
    Authors
    Jairson Rodrigues
    Description

    net traffic

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2024). SYNERGY - Open machine learning dataset on study selection in systematic reviews - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/1bea4d3c-ceef-5f63-89ed-80aeab18f601

SYNERGY - Open machine learning dataset on study selection in systematic reviews - Dataset - B2FIND

Explore at:
Dataset updated
Jul 21, 2024
Description

SYNERGY is a free and open dataset on study selection in systematic reviews, comprising 169,288 academic works from 26 systematic reviews. Only 2,834 (1.67%) of the academic works in the binary classified dataset are included in the systematic reviews. This makes the SYNERGY dataset a unique dataset for the development of information retrieval algorithms, especially for sparse labels. Due to the many available variables available per record (i.e. titles, abstracts, authors, references, topics), this dataset is useful for researchers in NLP, machine learning, network analysis, and more. In total, the dataset contains 82,668,134 trainable data points. The easiest way to get the SYNERGY dataset is via the synergy-dataset Python package. See https://github.com/asreview/synergy-dataset for all information. The recommended way to work with the SYNERGY dataset is via the Python package "synergy-dataset". This flexible package downloads and builds the dataset.

Search
Clear search
Close search
Google apps
Main menu